Re: [External] : Re: parallel forwarding vs. bridges

2021-06-22 Thread Alexandr Nedvedicky
Hello,



new diff still looks good to me. I could just catch typos
in comment.

> 
> I've been running versions of this diff in production at work, and have
> hit a few panics and asserts. All the issues we've hit should be
> addressed in this diff.
> 
> The first issue was that pfsync could be in the processing of sending a
> deferred packet while it's being removed from the state tree. Part of
> that removal process is stripping the state keys from the state, which
> pfsync uses to determine the address family so it knows which ip output
> routine to use. The quick and dirty fix to this is to have pfsync check
> if timeout state to see if the state is unlinked or not. This currently
> relies on pfsync undefer and pf being serialised by the NET_LOCK.
> 
> The second is that the timeout member on a state can change while the
> purge task is looking at it. We hit this assert in pf_state_expires:
> 
>   KASSERT(state->timeout != PFTM_UNLINKED);
> 
> pf_state_expires was called from the purge code like this:
> 
>   if ((cur->timeout == PFTM_UNLINKED) ||
>   (pf_state_expires(cur) <= getuptime()))
>   SLIST_INSERT_HEAD(, cur, gc_list);
> 

I see. I've missed those in earlier review. the pf(4) we have
in tree uses PF_STATE_ lock to keep cur->timeout consistent.
pf_purge_expired_states() executes the check under PF_STATE_ENTER_READ().

> 
> With my new locking scheme here, the state purge code is called without
> any of the locks that would serialise access the state->timeout
> variable. I think I found a solution to this without having to
> reintroduce extra locking, which should allow us to keep the purge scan
> running concurrently with pf actually handling packets.

I think it should work, because we make sure pf_remove_state()
gets called only once.

I've found two typos in comment, see below. otherwise looks good.

OK sashan
> + /*
> +  * pf_state_expires is used by the state purge task to
> +  * decide if a state is a candidate for cleanup, and by the
> +  * pfsync state export code to populate an expiry time.
> +  *
> +  * this function may be called by the state purge task while
> +  * the state is being modified. avoid inconsistent reads of
> +  * state->timeout by having the caller do the read (and any
> +  * chacks it needs to do on the same variable) and then pass
s/chacks/checks
> +  * their view of the timeout in here for this function to use.
> +  * the only consequnce of using a stale timeout value is
s/consequnce/consequence

OK sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-21 Thread David Gwynne
On Wed, Jun 16, 2021 at 02:59:19AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > 
> > as above, copyout with a sleeping lock is fine.
> > 
> > the whole point of my change is to give us the ability to lock in the
> > forwarding path separately to locking in the ioctl path. half of that is
> > so we can copyout safely. the other half is to avoid letting the ioctl
> > path block packet processing if we can avoid it as an alternative to
> > having the network stack having to yield the cpu.
> 
> I see. my confusion came from fact I've forgot pflock got turned to mutex,
> when we saw the crash.
> 
> > 
> > > let's take a look at this part of pf_purge_expired_states()
> > > from your diff:
> > > 
> > > 1543 NET_LOCK();
> > > 1544 rw_enter_write(_state_list.pfs_rwl);
> > > 1545 PF_LOCK();
> > > 1546 PF_STATE_ENTER_WRITE();
> > > 1547 SLIST_FOREACH(st, , gc_list) {
> > > 1548 if (st->timeout != PFTM_UNLINKED)
> > > 1549 pf_remove_state(st);
> > > 1550 
> > > 1551 pf_free_state(st);
> > > 1552 }
> > > 1553 PF_STATE_EXIT_WRITE();
> > > 1554 PF_UNLOCK();
> > > 1555 rw_exit_write(_state_list.pfs_rwl);
> > > 
> > > at line 1543 we grab NET_LOCK(), at line 1544 we are trying
> > > to grab new lock (pf_state_list.pfs_rwl) exclusively. 
> > > 
> > > with your change we might be running into situation, where we do 
> > > copyout() as a
> > > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to 
> > > acquire
> > > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who 
> > > might be
> > > doing uvm_fault() in copyout(9f).
> > > 
> > > I'm just worried we may be trading one bug for another bug.  may be my 
> > > concern
> > > is just a false alarm here. I don't know.
> > 
> > no, these things are all worth discussing.
> > 
> > it's definitely possible there's bugs in here, but im pretty confident
> > it's not the copyout one.
> > 
> 
> it seems to work. I'm running your diff with bluhm's parallel diff
> and do occasional pfctl -Fs/pfctl -ss under a load. so far so good.
> 
> 
> 
> > > I guess 'pfgpurge_expired_fragment(s);' is unintentional change, 
> > > right?
> > 
> > yeah, i dont know how i did that. vi is hard?
> 
> sure it is...
> thanks for fixing the nits.
> 
> 
> 
> > Index: if_pfsync.c
> > ===
> > RCS file: /cvs/src/sys/net/if_pfsync.c,v
> > retrieving revision 1.292
> > diff -u -p -r1.292 if_pfsync.c
> > --- if_pfsync.c 15 Jun 2021 10:10:22 -  1.292
> > +++ if_pfsync.c 15 Jun 2021 11:21:20 -
> > @@ -2545,22 +2545,34 @@ pfsync_bulk_start(void)
> >  {
> > struct pfsync_softc *sc = pfsyncif;
> 
> have not spot anything suspicious in if_pfsync.c
> 
> the new diff reads fine to me.
> 
> OK sashan

I've been running versions of this diff in production at work, and have
hit a few panics and asserts. All the issues we've hit should be
addressed in this diff.

The first issue was that pfsync could be in the processing of sending a
deferred packet while it's being removed from the state tree. Part of
that removal process is stripping the state keys from the state, which
pfsync uses to determine the address family so it knows which ip output
routine to use. The quick and dirty fix to this is to have pfsync check
if timeout state to see if the state is unlinked or not. This currently
relies on pfsync undefer and pf being serialised by the NET_LOCK.

The second is that the timeout member on a state can change while the
purge task is looking at it. We hit this assert in pf_state_expires:

KASSERT(state->timeout != PFTM_UNLINKED);

pf_state_expires was called from the purge code like this:

if ((cur->timeout == PFTM_UNLINKED) ||
(pf_state_expires(cur) <= getuptime()))
SLIST_INSERT_HEAD(, cur, gc_list);


With my new locking scheme here, the state purge code is called without
any of the locks that would serialise access the state->timeout
variable. I think I found a solution to this without having to
reintroduce extra locking, which should allow us to keep the purge scan
running concurrently with pf actually handling packets.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c22 Jun 2021 01:34:36 -
@@ -259,6 +259,7 @@ void pf_state_key_link_inpcb(struct 
p
 voidpf_state_key_unlink_inpcb(struct pf_state_key *);
 voidpf_inpcb_unlink_state_key(struct inpcb *);
 voidpf_pktenqueue_delayed(void *);
+int32_t pf_state_expires(const struct pf_state *, 
uint8_t);
 
 #if NPFLOG > 0
 void   

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-16 Thread Alexandr Nedvedicky
Hello,

re-sending the same diff updated to current if_pfsync.c.

diff avoids a recursion to PF_LOCK(), which is indicated
by panic stack here:


> > production firewalls panicked with the pf lock trying to lock against
> > itself a couple of nights ago:
> > 
> > db_enter() at db_enter+0x5
> > panic(81d47212) at panic+0x12a
> > rw_enter(82060e70,1) at rw_enter+0x261
> > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
> > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
> > ip6_output+0xd33
> > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
> > nd6_ns
> > _output+0x3e2
> > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
> > ,800024d040d8) at nd6_resolve+0x29d
> > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
> > e0,800024d040d8) at ether_resolve+0x127
> > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
> > 0) at ether_output+0x2a
> > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
> > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
> > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
> > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
> > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
> > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693
> > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
> > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
> > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
> > ip_forward+
> > 0x27a  
> > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
> > ip_input
> > _if+0x5fd
> > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
> > carp_input(8158f000,fd8065a2,5e000158) at
> > carp_input+0x1ac
> > ether_input(8158f000,fd8065a2) at ether_input+0x1c0
> > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
> > ether_input(8152f000,fd8065a2) at ether_input+0x76
> > if_input_process(801df048,800024d04ca8) at
> > if_input_process+0x5a
> > ifiq_process(801dbe00) at ifiq_process+0x6f
> > taskq_thread(8002b080) at taskq_thread+0x7b
> > end trace frame: 0x0, count: -26
> > 
> 

the recursion to PF_LOCK() is avoided by postponing the call to
pfsync_undefer() to very last phase of pf_test(), where we run
without holding PF_LOCK().

I believe we will be able to revert diff below once scope of PF_LOCK() in
pf_test() will be reduced in future. However future is not here yet. To reduce
PF_LOCK() scope in pf_test() we must give some love to pf's tables first.

comments? OKs?

thanks and
regards
sashan

8<---8<---8<--8<

diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index f1d292d67ff..9e7dc5064b9 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -1931,7 +1931,7 @@ pfsync_insert_state(struct pf_state *st)
 }
 
 int
-pfsync_defer(struct pf_state *st, struct mbuf *m)
+pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral **ppd)
 {
struct pfsync_softc *sc = pfsyncif;
struct pfsync_deferral *pd;
@@ -1944,21 +1944,30 @@ pfsync_defer(struct pf_state *st, struct mbuf *m)
m->m_flags & (M_BCAST|M_MCAST))
return (0);
 
+   pd = pool_get(>sc_pool, M_NOWAIT);
+   if (pd == NULL)
+   return (0);
+
+   /*
+* deferral queue grows faster, than timeout can consume,
+* we have to ask packet (caller) to help timer and dispatch
+* one deferral for us.
+*
+* We wish to call pfsync_undefer() here. Unfortunately we can't,
+* because pfsync_undefer() will be calling to ip_output(),
+* which in turn will call to pf_test(), which would then attempt
+* to grab PF_LOCK() we currently hold.
+*/
if (sc->sc_deferred >= 128) {
mtx_enter(>sc_deferrals_mtx);
-   pd = TAILQ_FIRST(>sc_deferrals);
-   if (pd != NULL) {
-   TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
+   *ppd = TAILQ_FIRST(>sc_deferrals);
+   if (*ppd != NULL) {
+   TAILQ_REMOVE(>sc_deferrals, *ppd, pd_entry);
sc->sc_deferred--;
}
mtx_leave(>sc_deferrals_mtx);
-   if (pd != NULL)
-   pfsync_undefer(pd, 0);
-   }
-
-   pd = pool_get(>sc_pool, M_NOWAIT);
-   if (pd == NULL)
-   return (0);
+   } else
+   *ppd = NULL;
 
m->m_pkthdr.pf.flags |= PF_TAG_GENERATED;
SET(st->state_flags, PFSTATE_ACK);
diff --git a/sys/net/if_pfsync.h b/sys/net/if_pfsync.h
index 2520c87a36c..037219537e8 100644
--- a/sys/net/if_pfsync.h
+++ 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-15 Thread Alexandr Nedvedicky
Hello,


> 
> as above, copyout with a sleeping lock is fine.
> 
> the whole point of my change is to give us the ability to lock in the
> forwarding path separately to locking in the ioctl path. half of that is
> so we can copyout safely. the other half is to avoid letting the ioctl
> path block packet processing if we can avoid it as an alternative to
> having the network stack having to yield the cpu.

I see. my confusion came from fact I've forgot pflock got turned to mutex,
when we saw the crash.

> 
> > let's take a look at this part of pf_purge_expired_states()
> > from your diff:
> > 
> > 1543 NET_LOCK();
> > 1544 rw_enter_write(_state_list.pfs_rwl);
> > 1545 PF_LOCK();
> > 1546 PF_STATE_ENTER_WRITE();
> > 1547 SLIST_FOREACH(st, , gc_list) {
> > 1548 if (st->timeout != PFTM_UNLINKED)
> > 1549 pf_remove_state(st);
> > 1550 
> > 1551 pf_free_state(st);
> > 1552 }
> > 1553 PF_STATE_EXIT_WRITE();
> > 1554 PF_UNLOCK();
> > 1555 rw_exit_write(_state_list.pfs_rwl);
> > 
> > at line 1543 we grab NET_LOCK(), at line 1544 we are trying
> > to grab new lock (pf_state_list.pfs_rwl) exclusively. 
> > 
> > with your change we might be running into situation, where we do copyout() 
> > as a
> > reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to 
> > acquire
> > pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who 
> > might be
> > doing uvm_fault() in copyout(9f).
> > 
> > I'm just worried we may be trading one bug for another bug.  may be my 
> > concern
> > is just a false alarm here. I don't know.
> 
> no, these things are all worth discussing.
> 
> it's definitely possible there's bugs in here, but im pretty confident
> it's not the copyout one.
> 

it seems to work. I'm running your diff with bluhm's parallel diff
and do occasional pfctl -Fs/pfctl -ss under a load. so far so good.



> > I guess 'pfgpurge_expired_fragment(s);' is unintentional change, right?
> 
> yeah, i dont know how i did that. vi is hard?

sure it is...
thanks for fixing the nits.



> Index: if_pfsync.c
> ===
> RCS file: /cvs/src/sys/net/if_pfsync.c,v
> retrieving revision 1.292
> diff -u -p -r1.292 if_pfsync.c
> --- if_pfsync.c   15 Jun 2021 10:10:22 -  1.292
> +++ if_pfsync.c   15 Jun 2021 11:21:20 -
> @@ -2545,22 +2545,34 @@ pfsync_bulk_start(void)
>  {
>   struct pfsync_softc *sc = pfsyncif;

have not spot anything suspicious in if_pfsync.c

the new diff reads fine to me.

OK sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-15 Thread David Gwynne
On Tue, Jun 08, 2021 at 06:54:25PM +0200, Alexandr Nedvedicky wrote:
> Hello David,
> 
> I'm still not sure if your change is ultimate fix, or just significantly
> minimizes risk of the bug. If I understand things right, the problem we are
> trying to solve:

> DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock
> as a reader.
> 
> it then walks through the whole state list and copies out (copyout(9f))
> data for each state into ioctl(2) buffer provided by calling process.

yep. it's not an ultimate fix. it's main goal is to allow the ioctl
side of things to read the states safely without blocking the packet
processing side of things.

> we may trip assert down in copyout(9f):
> > panic: acquiring blockable sleep lock with spinlock or critical section
> > held (rwlock) vmmaplk

this is about taking an rwlock (vmmaplk) while holding a mutex
(pflock). you can copyout while holding sleeping locks (eg, NET_LOCK
or the rwlock version of the pflocks).

the pfs_rwl is a sleeping lock, so it's safe to copyout while holding
it. that's why this diff works.

> > Stopped at  db_enter+0x10:  popq%rbp
> > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > *512895  28841  0 0x3  03K pfctl
> > db_enter() at db_enter+0x10
> > panic(81e19411) at panic+0x12a
> > witness_checkorder(fd83b09b4d18,1,0) at witness_checkorder+0xbce
> > rw_enter_read(fd83b09b4d08) at rw_enter_read+0x38
> > uvmfault_lookup(8000238e3418,0) at uvmfault_lookup+0x8a
> > uvm_fault_check(8000238e3418,8000238e3450,8000238e3478) at
> > uvm_fault_check+0x32
> > uvm_fault(fd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc
> > kpageflttrap(8000238e3590,e36553c000) at kpageflttrap+0x131
> > kerntrap(8000238e3590) at kerntrap+0x91
> > alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> > copyout() at copyout+0x53
> 
> 
> I'm just afraid that although your change significantly reduces the risk we
> will die with similar call stack as the one above, the new code is not bullet
> proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader
> (in pf_states_get() from your diff). I agree packets do not grab
> pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect.

as above, copyout with a sleeping lock is fine.

the whole point of my change is to give us the ability to lock in the
forwarding path separately to locking in the ioctl path. half of that is
so we can copyout safely. the other half is to avoid letting the ioctl
path block packet processing if we can avoid it as an alternative to
having the network stack having to yield the cpu.

> let's take a look at this part of pf_purge_expired_states()
> from your diff:
> 
> 1543 NET_LOCK();
> 1544 rw_enter_write(_state_list.pfs_rwl);
> 1545 PF_LOCK();
> 1546 PF_STATE_ENTER_WRITE();
> 1547 SLIST_FOREACH(st, , gc_list) {
> 1548 if (st->timeout != PFTM_UNLINKED)
> 1549 pf_remove_state(st);
> 1550 
> 1551 pf_free_state(st);
> 1552 }
> 1553 PF_STATE_EXIT_WRITE();
> 1554 PF_UNLOCK();
> 1555 rw_exit_write(_state_list.pfs_rwl);
> 
> at line 1543 we grab NET_LOCK(), at line 1544 we are trying
> to grab new lock (pf_state_list.pfs_rwl) exclusively. 
> 
> with your change we might be running into situation, where we do copyout() as 
> a
> reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to 
> acquire
> pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might 
> be
> doing uvm_fault() in copyout(9f).
> 
> I'm just worried we may be trading one bug for another bug.  may be my concern
> is just a false alarm here. I don't know.

no, these things are all worth discussing.

it's definitely possible there's bugs in here, but im pretty confident
it's not the copyout one.

> anyway there are few more nits in your diff.

yep.

> 
> 
> > Index: pf.c
> > ===
> > RCS file: /cvs/src/sys/net/pf.c,v
> > retrieving revision 1.1118
> > diff -u -p -r1.1118 pf.c
> > --- pf.c1 Jun 2021 09:57:11 -   1.1118
> > +++ pf.c3 Jun 2021 06:24:48 -
> > @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void)
> >  void
> >  pf_purge_timeout(void *unused)
> >  {
> > -   task_add(net_tq(0), _purge_task);
> > +   /* XXX move to systqmp to avoid KERNEL_LOCK */
> > +   task_add(systq, _purge_task);
> >  }
> 
> I would just clean up the comment. looks like we should be
> able to get pf's ioctl operations  out of KERNEL_LOCK completely.
> I'll take a further look at it, while be working in pf_ioctl.c

this isnt the ioctl side though, this is the periodic gc task. if the pf
locking is right then we could move the task now.

> 
> > @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops)
> >  * Fragments don't require PF_LOCK(), they use their own lock.
> >   

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-14 Thread Alexandr Nedvedicky
Hello,


> > it seems to me we may leak `pd` we took at line 1944. The leak
> > happens in case timeout_del() return zero.
> > 
> > diff below ignores the return value from timeout_del() and makes
> > sure we always call pfsync_undefefer()
> > 
> > I have not seen such panic in my test environment. Would you be able
> > to give my diff a try?
> 
> i hit a different fault in this code recently, so i've been looking
> at the code again. my fix for the second fault will conflict with this
> one. i'll send it out to tech@ separately though.
> 

I've just sent my OK to your diff.

> your diff looks like it moves the handling of undefer until after pf
> drops its locks. i was just going to make pfsync schedule a task or
> softint immediately if the queue was getting long, and then stop doing
> deferrals if it was way too long.

that's also alternative I was considering. I think I still have it
in somewhere in history. Both approaches are fine I think. Scheduling
a task looks less ugly. However as I was typing the code I've started
to realize it might be better to let packet to dispatch the deferral.
because this will share resources in more natural way.

from my point of view the story goes like that:

timeout does not dispatch deferrals fast enough, queue grows

packet needs to insert new deferral to queue

new deferral does not fit to queue, options are:

remove one deferral from queue to make space for
new one. packet then must dispatch the old deferral.
this conserves resources because packet (network subsystem)
pays cost (memory & cpu) for deferral processing right away.

schedule yet another task to dispatch deferral. My concern
here is that it allows network subsystem to allocate
more and more resources. If deferral does not fit
into regular timeout queue, we just let pfsync to create
an 'express task' to dispatch old deferral.


I see both those changes as 'interim fix'. We may be able to
restore original code we currently have in pfsync_defer() as soon
as we will shrink scope of PF_LOCK. Once PF_LOCK will be
in better shape, we will be able to make pfsync_defer() nice
and simple again.


thanks and
regards
sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-13 Thread David Gwynne
On Sat, Jun 05, 2021 at 01:09:01PM +0200, Alexandr Nedvedicky wrote:
> Hello David,
> 
> 
> 
> > the scope of the pf locks likely needs reduction anyway. one of my
> 
> I agree. pf_lock covers too much in PF currently. it protects,
> all rules, tables and fragment caches.
> 
> > production firewalls panicked with the pf lock trying to lock against
> > itself a couple of nights ago:
> > 
> > db_enter() at db_enter+0x5
> > panic(81d47212) at panic+0x12a
> > rw_enter(82060e70,1) at rw_enter+0x261
> > pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
> > ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
> > ip6_output+0xd33
> > nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
> > nd6_ns
> > _output+0x3e2
> > nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
> > ,800024d040d8) at nd6_resolve+0x29d
> > ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
> > e0,800024d040d8) at ether_resolve+0x127
> > ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
> > 0) at ether_output+0x2a
> > ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
> > pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
> > pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
> > pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
> > pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
> > 0,800024d045d8,800024d045fe) at pf_test_rule+0x693
> > pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
> > ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
> > ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
> > ip_forward+
> > 0x27a  
> > ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
> > ip_input
> > _if+0x5fd
> > ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
> > carp_input(8158f000,fd8065a2,5e000158) at
> > carp_input+0x1ac
> > ether_input(8158f000,fd8065a2) at ether_input+0x1c0
> > vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
> > ether_input(8152f000,fd8065a2) at ether_input+0x76
> > if_input_process(801df048,800024d04ca8) at
> > if_input_process+0x5a
> > ifiq_process(801dbe00) at ifiq_process+0x6f
> > taskq_thread(8002b080) at taskq_thread+0x7b
> > end trace frame: 0x0, count: -26
> > 
> 
> diff below postpones call to pfsync_undefer() to moment, when PF_LOCK() is
> released. I believe this should fix the panic you see. the change does not
> look nice. and can be later reverted, when pf_lock scope will be reduced.
> 
> I took a closer look at pfsync_defer() here:
> 
> 1941 
> 1942 if (sc->sc_deferred >= 128) {
> 1943 mtx_enter(>sc_deferrals_mtx);
> 1944 pd = TAILQ_FIRST(>sc_deferrals);
> 1945 TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
> 1946 sc->sc_deferred--;
> 1947 mtx_leave(>sc_deferrals_mtx);
> 1948 if (timeout_del(>pd_tmo))
> 1949 pfsync_undefer(pd, 0);
> 1950 }
> 1951 
> 1952 pd = pool_get(>sc_pool, M_NOWAIT);
> 1953 if (pd == NULL)
> 1954 return (0);
> 
> it seems to me we may leak `pd` we took at line 1944. The leak
> happens in case timeout_del() return zero.
> 
> diff below ignores the return value from timeout_del() and makes
> sure we always call pfsync_undefefer()
> 
> I have not seen such panic in my test environment. Would you be able
> to give my diff a try?

i hit a different fault in this code recently, so i've been looking
at the code again. my fix for the second fault will conflict with this
one. i'll send it out to tech@ separately though.

your diff looks like it moves the handling of undefer until after pf
drops its locks. i was just going to make pfsync schedule a task or
softint immediately if the queue was getting long, and then stop doing
deferrals if it was way too long.

> 8<---8<---8<--8<
> diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
> index b90aa934de4..a5e5e1ae0f3 100644
> --- a/sys/net/if_pfsync.c
> +++ b/sys/net/if_pfsync.c
> @@ -272,7 +272,6 @@ void  pfsync_syncdev_state(void *);
>  void pfsync_ifdetach(void *);
>  
>  void pfsync_deferred(struct pf_state *, int);
> -void pfsync_undefer(struct pfsync_deferral *, int);
>  void pfsync_defer_tmo(void *);
>  
>  void pfsync_cancel_full_update(struct pfsync_softc *);
> @@ -1927,7 +1926,7 @@ pfsync_insert_state(struct pf_state *st)
>  }
>  
>  int
> -pfsync_defer(struct pf_state *st, struct mbuf *m)
> +pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral 
> **ppd)
>  {
>   struct pfsync_softc *sc = pfsyncif;
>   struct 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-08 Thread Alexandr Nedvedicky
Hello David,

I'm still not sure if your change is ultimate fix, or just significantly
minimizes risk of the bug. If I understand things right, the problem we are
trying to solve:

DIOCGETSTATES we have in current, grabs NET_LOCK() and pf_state_lock
as a reader.

it then walks through the whole state list and copies out (copyout(9f))
data for each state into ioctl(2) buffer provided by calling process.

we may trip assert down in copyout(9f):
> panic: acquiring blockable sleep lock with spinlock or critical section
> held (rwlock) vmmaplk
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *512895  28841  0 0x3  03K pfctl
> db_enter() at db_enter+0x10
> panic(81e19411) at panic+0x12a
> witness_checkorder(fd83b09b4d18,1,0) at witness_checkorder+0xbce
> rw_enter_read(fd83b09b4d08) at rw_enter_read+0x38
> uvmfault_lookup(8000238e3418,0) at uvmfault_lookup+0x8a
> uvm_fault_check(8000238e3418,8000238e3450,8000238e3478) at
> uvm_fault_check+0x32
> uvm_fault(fd83b09b4d00,e36553c000,0,2) at uvm_fault+0xfc
> kpageflttrap(8000238e3590,e36553c000) at kpageflttrap+0x131
> kerntrap(8000238e3590) at kerntrap+0x91
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> copyout() at copyout+0x53


I'm just afraid that although your change significantly reduces the risk we
will die with similar call stack as the one above, the new code is not bullet
proof. We still do copyout() while holding pf_state_list.pfs_rwl as a reader
(in pf_states_get() from your diff). I agree packets do not grab
pf_state_list.pfs_rwl at all. Your fix solves this problem in this respect.

let's take a look at this part of pf_purge_expired_states()
from your diff:

1543 NET_LOCK();
1544 rw_enter_write(_state_list.pfs_rwl);
1545 PF_LOCK();
1546 PF_STATE_ENTER_WRITE();
1547 SLIST_FOREACH(st, , gc_list) {
1548 if (st->timeout != PFTM_UNLINKED)
1549 pf_remove_state(st);
1550 
1551 pf_free_state(st);
1552 }
1553 PF_STATE_EXIT_WRITE();
1554 PF_UNLOCK();
1555 rw_exit_write(_state_list.pfs_rwl);

at line 1543 we grab NET_LOCK(), at line 1544 we are trying
to grab new lock (pf_state_list.pfs_rwl) exclusively. 

with your change we might be running into situation, where we do copyout() as a
reader on pf_state_list.pfs_rwl. Then we grab NET_LOCK() and attempt to acquire
pf_state_list.pfs_rwl exclusively, which is still occupied by guy, who might be
doing uvm_fault() in copyout(9f).

I'm just worried we may be trading one bug for another bug.  may be my concern
is just a false alarm here. I don't know.

anyway there are few more nits in your diff.



> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1118
> diff -u -p -r1.1118 pf.c
> --- pf.c  1 Jun 2021 09:57:11 -   1.1118
> +++ pf.c  3 Jun 2021 06:24:48 -
> @@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void)
>  void
>  pf_purge_timeout(void *unused)
>  {
> - task_add(net_tq(0), _purge_task);
> + /* XXX move to systqmp to avoid KERNEL_LOCK */
> + task_add(systq, _purge_task);
>  }

I would just clean up the comment. looks like we should be
able to get pf's ioctl operations  out of KERNEL_LOCK completely.
I'll take a further look at it, while be working in pf_ioctl.c
  

> @@ -1280,11 +1311,10 @@ pf_purge(void *xnloops)
>* Fragments don't require PF_LOCK(), they use their own lock.
>*/
>   if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
> - pf_purge_expired_fragments();
> + pfgpurge_expired_fragment(s);
>   *nloops = 0;
>   }
>   NET_UNLOCK();
> - KERNEL_UNLOCK();
>  
>   timeout_add_sec(_purge_to, 1);
>  }

I guess 'pfgpurge_expired_fragment(s);' is unintentional change, right?


> Index: pfvar_priv.h
> ===
> RCS file: /cvs/src/sys/net/pfvar_priv.h,v
> retrieving revision 1.6
> diff -u -p -r1.6 pfvar_priv.h
> --- pfvar_priv.h  9 Feb 2021 14:06:19 -   1.6
> +++ pfvar_priv.h  3 Jun 2021 06:24:48 -
> @@ -38,6 +38,115 @@
>  #ifdef _KERNEL
>  
>  #include 
> +#include 
> +
> +/*
> +
> +states are linked into a global list to support the following
> +functionality:
> +
> + - garbage collection
> + - pfsync bulk send operations
> + - bulk state fetches via the DIOCGETSTATES ioctl
> + - bulk state clearing via the DIOCCLRSTATES ioctl
> +
> +states are inserted into the global pf_state_list once it has also
> +been successfully added to the various trees that make up the state
> +table. states are only removed from the pf_state_list by the garbage
> +collection process.
> +

the multiline comment does not match style, I think.

I kind of 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-07 Thread Hrvoje Popovski
On 7.6.2021. 9:25, Alexandr Nedvedicky wrote:
> Hello,
> 
> On Sun, Jun 06, 2021 at 09:54:50PM +0200, Hrvoje Popovski wrote:
> 
>>
>> this one?
> yes, exactly this one.

Ok, great ..

this and similar panics are products of accidental stp loops. I have two
same boxes with the same network setup connected to the same switch. At
one point i've tested bluhm@ and sashan@ diffs and forget that i have
tpmr(4) or veb(4) configured on both boxes, which is loop ...

So now, when i understood what i've done i can easily reproduce panic
with yours and dlg@ diff ... i just start to send traffic over one box
and other box panic :)

all those panics are without pf and with tpmr or veb

here's few panics .. i can send details if you find some panics
interesting ..



login: papnapiniacc:n: i kc e: r  n e   l   di  a  g n  o s  ti
 ca s  s  e
 r  t i o  n   "  r v  e b  e   = =N  UL L  "f  a  i
   l e  d  :f
i  l e"
p /okseyoslr/_ndenole t_ /piu ft : _  e t  h e  rb r  i dg  e  .
c  " , li
  n e22  6e  b t_  r e p  la  c ee  b   0  x
f  f f  f  8

 0e0bd0eipa0lg0n6:8 o2 s te i6 c8n e  b  e   0 x  f  f f  ff
d  8 3  ab  7
 7  b  d  9 8   r v  e  b e   0  x f  f f  ff  d 8 3  a  b
  7  7  b e  1  0

 Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  91272  9  0 0x14000  0x2001  softnet
*410720  68055  0 0x14000  0x2000  softnet
 444532  44174  0 0x14000  0x2002  softnet
 149723   5636  0 0x14000  0x2004  softnet
db_enter() at db_enter+0x10
panic(81e85119) at panic+0xbf
etherbridge_map(80682e68,8016ad00,90e2ba33b4a0) at
etherbridge_map+0x37d
veb_port_input(80082048,fd80bcb8f600,90e2ba33b4a1,8016ad00)
 at veb_port_input+0x2fa
ether_input(80082048,fd80bcb8f600) at ether_input+0xf5
if_input_process(80082048,800023871988) at if_input_process+0x6f
ifiq_process(80080f00) at ifiq_process+0x69
taskq_thread(80030200) at taskq_thread+0x9f
end trace frame: 0x0, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.

ddb{0}> show panic
 cpu0: kernel diagnostic assertion "rvebe == NULL" failed: file
"/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68
nebe 0xfd83ab77bd98 rvebe 0xfd83ab77be10
 cpu2: pool_do_put: ebepl: double pool_put: 0xfd83ab77b8e8
*cpu1: kernel diagnostic assertion "smr->smr_func == NULL" failed: file
"/sys/kern/kern_smr.c", line 247
ddb{0}>




r620-1# papnainicc: :   k er n  eld ia g  n o st i  c  a s s e r ti
o n  " r e f c  n
t  ! =~0 "f a i le d  :   f il e" /s y s / k  e r
 n/kkeernrenl_ s yn c  h . c ",l i ne   8  2 6
 Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  73487  28509  0 0x14000  0x2001  softnet
  99803  26490  0 0x14000  0x2005  softnet
*372376  10060  0 0x14000  0x2004  softnet
 128194  68537  0 0x14000  0x2003  softnet
 105333  38464  0 0x14000 0x42000  softclock
db_enter() at db_enter+0x10
panic(81e07adc) at panic+0xbf
__assert(81e7558b,81de2e62,33a,81e13185) at
__assert+0x2b
refcnt_rele(fd839484c768) at refcnt_rele+0x6f
etherbridge_map(80682e68,812c6f00,90e2ba33b4a0) at
etherbridge_map+0x1ca
veb_port_input(80087048,fd80a6545200,90e2ba33b4a1,812c6f00)
 at veb_port_input+0x2fa
ether_input(80087048,fd80a6545200) at ether_input+0xf5
if_input_process(80087048,80002386bba8) at if_input_process+0x6f
ifiq_process(80086d00) at ifiq_process+0x69
taskq_thread(80030100) at taskq_thread+0x9f
end trace frame: 0x0, count: 5
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.

ddb{4}> show panic
 cpu4: kernel diagnostic assertion "refcnt != ~0" failed: file
"/sys/kern/kern_synch.c", line 826
*cpu1: kernel diagnostic assertion "smr->smr_func == NULL" failed: file
"/sys/kern/kern_smr.c", line 247
ddb{4}>



r620-1# papppnpppannnanniiniiciicc::c:cc : :: poo
   l_d
  o _ge  t: mc l 2k
2:p
 a ge emp  ty
   Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
* 13147  52870  0 0x14000  0x2002  softnet
db_enter() at db_enter+0x10
panic(81e0a795) at panic+0xbf
pool_do_get(821f6d98,2,800023864bb4) at pool_do_get+0x309
pool_get(821f6d98,2) at pool_get+0x9d
m_clget(0,2,802) at 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-07 Thread Alexandr Nedvedicky
Hello,

On Sun, Jun 06, 2021 at 09:54:50PM +0200, Hrvoje Popovski wrote:

> 
> 
> this one?

yes, exactly this one.


> 
> 
> ddb{5}> show panic
>  cpu5: kernel diagnostic assertion "rvebe == NULL" failed: file
> "/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68
> nebe 0xfd8391e91630 rvebe 0xfd8391e914c8
> *cpu2: kernel diagnostic assertion "smr->smr_func == NULL" failed: file
> "/sys/kern/kern_smr.c", line 247
> ddb{5}>

now I see things are actually more colorful, thanks to work to make
panic handling mp-friendly.

In earlier panic I could see 'smr->smr_func == NULL' was the only
assert on fire. With mp panic handling I see there are actually
more asserts on fire. This is neat.

it looks like packets are trying to ride through port, which is
just being destroyed.


thanks and
regards
sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-06 Thread Hrvoje Popovski
On 5.6.2021. 18:43, Alexandr Nedvedicky wrote:
> According to early tests it works well. Currently there is a one
> mysterious panic, which Hrvoje may be able to comment on
> how to trigger it. The stack looks as follows:
> 
>   kernel diagnostic assertion "smr->smr_func == NULL" failed: file
>   "/sys/kern/kern_smr.c", line 247
>   panic() at panic+0x12a
>   __assert() at
>   __assert+0x2b
>   smr_call_impl() at smr_call_impl+0xd4
>   veb_port_input() at veb_port_input+0x2fa
>   ether_input() at ether_input+0x10b
>   if_input_process() at if_input_process+0x6f
>   ifiq_process() at ifiq_process+0x69
>   taskq_thread() at taskq_thread+0x81


this one?

r620-1# ppaanniicc: :  k e  r ne ld ia g n o s  t icas s e  r t
io n" rv e b e
= =  N U  LL "f ai l e d  :  f i  l e  " /  s y s/ n e
tk/eirnfe_el t he r b r  i d g e. c  " ,  l  i ne   2 2 6eb t  _ r
ep l  a ce   e  b
0 x f f  f f 80 0  0 00 6 8  2 e6 8n eb e   0  x f
 fdifaffgdno8s39t1iec 9 16 3 0r v eb e   0  x f ff f  f d 83 9 1  e
9 1 4c 8
 Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
 218522  97075  0 0x14000  0x2001  softnet
*210501  76472  0 0x14000  0x2005  softnet
 192964556  0 0x14000  0x2002  softnet
 312196  98971  0 0x14000  0x2003  softnet
db_enter() at db_enter+0x10
panic(81e84b7f) at panic+0xbf
etherbridge_map(80682e68,8016a000,90e2ba33b4a0) at
etherbridge_map+0x37d
veb_port_input(80082048,fd80a6ec8800,90e2ba33b4a1,8016a000)
 at veb_port_input+0x2fa
ether_input(80082048,fd80a6ec8800) at ether_input+0xf5
if_input_process(80082048,800023871bf8) at if_input_process+0x6f
ifiq_process(80086300) at ifiq_process+0x69
taskq_thread(80030200) at taskq_thread+0x9f
end trace frame: 0x0, count: 7
https://www.openbsd.org/ddb.html describes the minimum info required in
bug reports.  Insufficient info makes it difficult to find and fix bugs.


ddb{5}> show panic
 cpu5: kernel diagnostic assertion "rvebe == NULL" failed: file
"/sys/net/if_etherbridge.c", line 226 ebt_replace eb 0x80682e68
nebe 0xfd8391e91630 rvebe 0xfd8391e914c8
*cpu2: kernel diagnostic assertion "smr->smr_func == NULL" failed: file
"/sys/kern/kern_smr.c", line 247
ddb{5}>



Re: [External] : Re: parallel forwarding vs. bridges

2021-06-05 Thread Alexandr Nedvedicky
Hello David,


> 
> my best argument for mutexes in pf is that currently we use smr critical
> sections around things like the bridge and aggr port selection, which
> are very obviously read-only workloads. pf may be a largely read-only
> workload, but where it is at the moment means it's not going to get run
> concurrently anyway.
> 

I'm glad you mention bridges and smr. with bluhm's parallel forwarding
diff [1] Hrvoje noticed panics with various bridges:

db_enter() at db_enter+0x10
panic() at panic+0x12a
__assert() at __assert+0x2b
assertwaitok() at assertwaitok+0xdc
mi_switch() at mi_switch+0x40
sleep_finish() at sleep_finish+0x11c
rw_enter() at rw_enter+0x229
pf_test() at pf_test+0x1055
tpmr_pf() at tpmr_pf+0x7e
tpmr_input() at tpmr_input+0x124
ether_input() at ether_input+0xf5
if_input_process() at if_input_process+0x6f
ifiq_process() at ifiq_process+0x69

the things start to go downhill right here in ether_input():

 426 
 427 smr_read_enter();
 428 eb = SMR_PTR_GET(>ac_brport);
 429 if (eb != NULL) {
 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
 431 if (m == NULL) {
 432 smr_read_leave();
 433 return;
 434 }
 435 }
 436 smr_read_leave();
 437 

we panic, because PF is going to sleep on one of its locks. 
this is the case of 1 of 40 packets. The one, which needs
to create state. ether_input() can not expect that ->eb_input(),
won't block. May be someday one day in future, but we are not
there yet.

we don't see panic in current tree, because of exclusive
net_lock. The ether_input() runs exclusively as a writer
on netlock, thus all further locks down the stack are
grabbed without blocking.

My personal though on this: non-blocking ->eb_input() is
pretty hard requirement. I don't mind if packet processing
gives up a CPU to wait for another packet to get its job
done. This gives other process chance to use cpu, be it
another packet or any process in system. lack of sleeping
points may lead to less responsive system, when network
will get more busy.

diff below combines smr_read... with reference counting using 
traditional atomic ops. The idea is to adjust code found in ether_input():

 426 
 427 smr_read_enter();
 428 eb = SMR_PTR_GET(>ac_brport);
 429 if (eb != NULL)
 430 eb->eb_port_take(eb->eb_port);
 431 smr_read_leave();
 432 if (eb != NULL) {
 433 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
 434 eb->eb_port_rele(eb->eb_port);
 435 if (m == NULL) {
 436 return;
 437 }
 438 }
 439 

According to early tests it works well. Currently there is a one
mysterious panic, which Hrvoje may be able to comment on
how to trigger it. The stack looks as follows:

kernel diagnostic assertion "smr->smr_func == NULL" failed: file
"/sys/kern/kern_smr.c", line 247
panic() at panic+0x12a
__assert() at
__assert+0x2b
smr_call_impl() at smr_call_impl+0xd4
veb_port_input() at veb_port_input+0x2fa
ether_input() at ether_input+0x10b
if_input_process() at if_input_process+0x6f
ifiq_process() at ifiq_process+0x69
taskq_thread() at taskq_thread+0x81

It's not obvious from stack above, but we die in etherbridge_map(),
which is being called from here in veb_port_input():

#if 0 && defined(IPSEC)
if (ISSET(ifp->if_flags, IFF_LINK2) &&
(m = veb_ipsec_in(ifp0, m)) == NULL)
return (NULL);
#endif

eh = mtod(m, struct ether_header *);

if (ISSET(p->p_bif_flags, IFBIF_LEARNING))
etherbridge_map(>sc_eb, p, src);

CLR(m->m_flags, M_BCAST|M_MCAST);
SET(m->m_flags, M_PROTO1);

the panic itself must be triggered by 'ebe_rele(oebe);' found
at the end of etherbridge_map(). I've spent few days with it,
and have no idea how it could happen. looks like result of
memory corruption/use-after-free bug.

anyway I would like to know if it would make sense to go with diff
below. This will allow us to keep hunting for other bugs sitting
in parallel forwarding path.

thanks and
sashan

[1] paralle://marc.info/?l=openbsd-tech=161903387904923=2 

8<---8<---8<--8<
diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c
index 703be01648f..c41f3c93668 100644
--- a/sys/net/if_bridge.c
+++ b/sys/net/if_bridge.c
@@ -138,6 

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-05 Thread Alexandr Nedvedicky
Hello David,



> the scope of the pf locks likely needs reduction anyway. one of my

I agree. pf_lock covers too much in PF currently. it protects,
all rules, tables and fragment caches.

> production firewalls panicked with the pf lock trying to lock against
> itself a couple of nights ago:
> 
> db_enter() at db_enter+0x5
> panic(81d47212) at panic+0x12a
> rw_enter(82060e70,1) at rw_enter+0x261
> pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
> ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
> ip6_output+0xd33
> nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
> nd6_ns
> _output+0x3e2
> nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
> ,800024d040d8) at nd6_resolve+0x29d
> ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
> e0,800024d040d8) at ether_resolve+0x127
> ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
> 0) at ether_output+0x2a
> ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
> pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
> pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
> pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
> pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
> 0,800024d045d8,800024d045fe) at pf_test_rule+0x693
> pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
> ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
> ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
> ip_forward+
> 0x27a  
> ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
> ip_input
> _if+0x5fd
> ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
> carp_input(8158f000,fd8065a2,5e000158) at
> carp_input+0x1ac
> ether_input(8158f000,fd8065a2) at ether_input+0x1c0
> vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
> ether_input(8152f000,fd8065a2) at ether_input+0x76
> if_input_process(801df048,800024d04ca8) at
> if_input_process+0x5a
> ifiq_process(801dbe00) at ifiq_process+0x6f
> taskq_thread(8002b080) at taskq_thread+0x7b
> end trace frame: 0x0, count: -26
> 

diff below postpones call to pfsync_undefer() to moment, when PF_LOCK() is
released. I believe this should fix the panic you see. the change does not
look nice. and can be later reverted, when pf_lock scope will be reduced.

I took a closer look at pfsync_defer() here:

1941 
1942 if (sc->sc_deferred >= 128) {
1943 mtx_enter(>sc_deferrals_mtx);
1944 pd = TAILQ_FIRST(>sc_deferrals);
1945 TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
1946 sc->sc_deferred--;
1947 mtx_leave(>sc_deferrals_mtx);
1948 if (timeout_del(>pd_tmo))
1949 pfsync_undefer(pd, 0);
1950 }
1951 
1952 pd = pool_get(>sc_pool, M_NOWAIT);
1953 if (pd == NULL)
1954 return (0);

it seems to me we may leak `pd` we took at line 1944. The leak
happens in case timeout_del() return zero.

diff below ignores the return value from timeout_del() and makes
sure we always call pfsync_undefefer()

I have not seen such panic in my test environment. Would you be able
to give my diff a try?

thanks a lot
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index b90aa934de4..a5e5e1ae0f3 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -272,7 +272,6 @@ voidpfsync_syncdev_state(void *);
 void   pfsync_ifdetach(void *);
 
 void   pfsync_deferred(struct pf_state *, int);
-void   pfsync_undefer(struct pfsync_deferral *, int);
 void   pfsync_defer_tmo(void *);
 
 void   pfsync_cancel_full_update(struct pfsync_softc *);
@@ -1927,7 +1926,7 @@ pfsync_insert_state(struct pf_state *st)
 }
 
 int
-pfsync_defer(struct pf_state *st, struct mbuf *m)
+pfsync_defer(struct pf_state *st, struct mbuf *m, struct pfsync_deferral **ppd)
 {
struct pfsync_softc *sc = pfsyncif;
struct pfsync_deferral *pd;
@@ -1939,19 +1938,29 @@ pfsync_defer(struct pf_state *st, struct mbuf *m)
m->m_flags & (M_BCAST|M_MCAST))
return (0);
 
+   pd = pool_get(>sc_pool, M_NOWAIT);
+   if (pd == NULL)
+   return (0);
+
+   /*
+* deferral queue grows faster, than timeout can consume,
+* we have to ask packet (caller) to help timer and dispatch
+* one deferral for us.
+*
+* We wish to call pfsync_undefer() here. Unfortunately we can't,
+* because pfsync_undefer() will be calling to ip_output(),
+* which in turn will call to pf_test(), which would then attempt
+

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-03 Thread David Gwynne
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote:
> >  pf_purge_expired_states(u_int32_t maxcheck)
> >  {
> 
> > -   cur = pf_state_ref(next);
> > +   do {
> > +   if ((cur->timeout == PFTM_UNLINKED) ||
> > +   (pf_state_expires(cur) <= getuptime())) {
> > +   SLIST_INSERT_HEAD(, pf_state_ref(cur), gc_list);
>   
> I wonder: is the extra reference you are chasing for coming from here?
> I suspect pf_state_ref(cur) is being called two times, as macro expands.

I think you're right :D :D

This is an updated diff with a fix for this and the explanation. It also
reduces the scope of the NET_LOCK so scanning the state list shouldn't
block the network stack.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c3 Jun 2021 06:24:48 -
@@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s
 struct pf_src_tree tree_src_tracking;
 
 struct pf_state_tree_id tree_id;
-struct pf_state_queue state_list;
+struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list);
 
 RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare);
 RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key);
@@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold *
return (threshold->count > threshold->limit);
 }
 
+void
+pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /*
+* we can always put states on the end of the list.
+*
+* things reading the list should take a read lock, then
+* the mutex, get the head and tail pointers, release the
+* mutex, and then they can iterate between the head and tail.
+*/
+
+   pf_state_ref(st); /* get a ref for the list */
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_INSERT_TAIL(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+}
+
+void
+pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /* states can only be removed when the write lock is held */
+   rw_assert_wrlock(>pfs_rwl);
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_REMOVE(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+
+   pf_state_unref(st); /* list no longer references the state */
+}
+
 int
 pf_src_connlimit(struct pf_state **state)
 {
@@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str
PF_STATE_EXIT_WRITE();
return (-1);
}
-   TAILQ_INSERT_TAIL(_list, s, entry_list);
+   pf_state_list_insert(_state_list, s);
pf_status.fcounters[FCNT_STATE_INSERT]++;
pf_status.states++;
pfi_kif_ref(kif, PFI_KIF_REF_STATE);
@@ -1247,7 +1278,8 @@ pf_purge_expired_rules(void)
 void
 pf_purge_timeout(void *unused)
 {
-   task_add(net_tq(0), _purge_task);
+   /* XXX move to systqmp to avoid KERNEL_LOCK */
+   task_add(systq, _purge_task);
 }
 
 void
@@ -1255,9 +1287,6 @@ pf_purge(void *xnloops)
 {
int *nloops = xnloops;
 
-   KERNEL_LOCK();
-   NET_LOCK();
-
/*
 * process a fraction of the state table every second
 * Note:
@@ -1268,6 +1297,8 @@ pf_purge(void *xnloops)
pf_purge_expired_states(1 + (pf_status.states
/ pf_default_rule.timeout[PFTM_INTERVAL]));
 
+   NET_LOCK();
+
PF_LOCK();
/* purge other expired types every PFTM_INTERVAL seconds */
if (++(*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
@@ -1280,11 +1311,10 @@ pf_purge(void *xnloops)
 * Fragments don't require PF_LOCK(), they use their own lock.
 */
if ((*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
-   pf_purge_expired_fragments();
+   pfgpurge_expired_fragment(s);
*nloops = 0;
}
NET_UNLOCK();
-   KERNEL_UNLOCK();
 
timeout_add_sec(_purge_to, 1);
 }
@@ -1447,7 +1477,7 @@ pf_free_state(struct pf_state *cur)
}
pf_normalize_tcp_cleanup(cur);
pfi_kif_unref(cur->kif, PFI_KIF_REF_STATE);
-   TAILQ_REMOVE(_list, cur, entry_list);
+   pf_state_list_remove(_state_list, cur);
if (cur->tag)
pf_tag_unref(cur->tag);
pf_state_unref(cur);
@@ -1458,53 +1488,77 @@ pf_free_state(struct pf_state *cur)
 void
 pf_purge_expired_states(u_int32_t maxcheck)
 {
+   /*
+* this task/thread/context/whatever is the only thing that
+* removes states from the pf_state_list, so the cur reference
+* it holds between calls is guaranteed to still be in the
+* list.
+*/
static struct pf_state  *cur = NULL;
-   struct pf_state *next;
-   SLIST_HEAD(pf_state_gcl, pf_state) gcl;
+
+   struct pf_state *head, *tail;
+   

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-03 Thread David Gwynne
On Thu, Jun 03, 2021 at 01:09:48AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> looks like my earlier mail got eaten by SPAM filter, let's try again...
> 
> 
> > 
> > moving pf locks to mutexes makes sense to me, but like you say,
> > this will need testing and experimentation. one of the issues
> 
> I'm not entirely convinced that trading rwlock for mutexes
> is a good thing. IMO majority of packets are being processed
> as a readers on state lock. If packet does not match state,
> then it needs to be pushed through rules, where we run to pf_lock
> and state lock. Both those locks are grabbed exclusively on
> this path. However I think we can relax pf_lock to reader
> in a future.

there's obviously plusses and minuses for trading rwlocks for mutexes.

here's the states info from one of my firewalls at work:

State Table  Total Rate
  current entries   789418   
  half-open tcp   19101626   
  searches   2835612773525   221420.0/s
  inserts  37030235256 2891.5/s
  removals 37029445838 2891.5/s

the ratio of searches to state table updates is like 40 to 1, so yes,
letting readers through concurrently would be useful.

> I'd like to better understand why mutexes (spin locks) are better in
> network subsystem? I'm concerned that removing all sleeping points from
> network stack will create life harder for scheduler. If such box will
> be hammered by network, the scheduler won't allow other components to
> run, because the whole network subsystem got turned to one giant
> non-interruptible block. I admit my concern may come from lack
> of understanding of overall concept.

as i said, there's plusses and minuses for mutexes as pflocks. the minus
is that you lose the ability to run pf concurrently. however, i'm pretty
sure that there's a lot of mutable state inside pf that gets updated on
every packet that almost certainly benefits from being run exclusively.
im mostly thinking about the packet and byte counters on rules and
states.

my best argument for mutexes in pf is that currently we use smr critical
sections around things like the bridge and aggr port selection, which
are very obviously read-only workloads. pf may be a largely read-only
workload, but where it is at the moment means it's not going to get run
concurrently anyway.

i think the current scope of the locks in pf needs to be looked at
anyway. we might be able to reduce the contention and allow concurrency
with mutexes by doing something more complicated than just swapping the
lock type around. eg, could the ruleset be wired up with SMR pointers
for traversal? could we split the state trees into bucks and hash
packets into them for lookup? stoeplitz is probably cheap enough to make
that possible...

the scope of the pf locks likely needs reduction anyway. one of my
production firewalls panicked with the pf lock trying to lock against
itself a couple of nights ago:

db_enter() at db_enter+0x5
panic(81d47212) at panic+0x12a
rw_enter(82060e70,1) at rw_enter+0x261
pf_test(18,2,8158f000,800024d03db8) at pf_test+0x118c
ip6_output(fd80049a3a00,0,0,0,800024d03eb0,0) at
ip6_output+0xd33
nd6_ns_output(8158f000,0,800024d04228,fd8279b3b420,0) at
nd6_ns
_output+0x3e2
nd6_resolve(8158f000,fd816c6b2ae0,fd80659d8300,800024d04220
,800024d040d8) at nd6_resolve+0x29d
ether_resolve(8158f000,fd80659d8300,800024d04220,fd816c6b2a
e0,800024d040d8) at ether_resolve+0x127
ether_output(8158f000,fd80659d8300,800024d04220,fd816c6b2ae
0) at ether_output+0x2a
ip6_output(fd80659d8300,0,0,0,0,0) at ip6_output+0x1180
pfsync_undefer_notify(fd841dbec7b8) at pfsync_undefer_notify+0xac
pfsync_undefer(fd841dbec7b8,0) at pfsync_undefer+0x8d
pfsync_defer(fd82303cb310,fd8065a2) at pfsync_defer+0xfe
pf_test_rule(800024d04600,800024d045e8,800024d045e0,800024d045f
0,800024d045d8,800024d045fe) at pf_test_rule+0x693
pf_test(2,3,815a9000,800024d04798) at pf_test+0x10f1
ip_output(fd8065a2,0,800024d04850,1,0,0) at ip_output+0x829
ip_forward(fd8065a2,81541800,fd817a7722d0,0) at
ip_forward+
0x27a  
ip_input_if(800024d04a80,800024d04a7c,4,0,81541800) at
ip_input
_if+0x5fd
ipv4_input(81541800,fd8065a2) at ipv4_input+0x37
carp_input(8158f000,fd8065a2,5e000158) at
carp_input+0x1ac
ether_input(8158f000,fd8065a2) at ether_input+0x1c0
vlan_input(8152f000,fd8065a2) at vlan_input+0x19a
ether_input(8152f000,fd8065a2) at ether_input+0x76
if_input_process(801df048,800024d04ca8) at
if_input_process+0x5a
ifiq_process(801dbe00) at ifiq_process+0x6f

Re: [External] : Re: parallel forwarding vs. bridges

2021-06-02 Thread David Gwynne
On Wed, May 19, 2021 at 01:48:26AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> just for the record...
> 
> 
> 
> > > in current tree the ether_input() is protected by NET_LOCK(), which is 
> > > grabbed
> > > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> > > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> > > implications on smr read section above. The ting is the call to 
> > > eb->eb_input()
> > > can sleep now. This is something what needs to be avoided within smr 
> > > section.
> > 
> > Is the new sleeping point introduced by the fact the PF_LOCK() is a
> > rwlock?  Did you consider using a mutex, at least for the time being,
> > in order to not run in such issues?
> 
> below is a diff, which trades both pf(4) rw-locks for mutexes.
> 
> diff compiles, it still needs testing/experimenting.

hi. i'm trying to get my head back into this space, so i'm trying to
have a look at these diffs properly.

moving pf locks to mutexes makes sense to me, but like you say,
this will need testing and experimentation. one of the issues
identified in another part of this thread (and on icb?) is that the
ioctl path does some stuff which can sleep, but you can't sleep
while holding a mutex which would now be protecting the data
structures you're trying to look at.

a more specific example is if you're trying to dump the state table.
like you say, the state table is currently protected by the
pf_state_lock. iterating over the state table and giving up this lock to
copyout the entries is... not fun.

the diff below is a work in progress attempt to address this. it works
by locking the pf state list separately so it can support traversal
without (long) holds of locks needed in the forwarding path. the very
short explanation is that the TAILQ holding the states is locked
separately to the links between the states in the list. a much longer
explanataion is in the diff in the pfvar_priv.h chunk.

inserting states into the TAILQ while processing packets uses a mutex.
iterating over the list in the pf purge processing, ioctl paths, and
pfsync can largely rely on an rwlock. the rwlock would also allow
concurrent access to the list of states, ie, you can dump the list of
states while the gc thread is looking for states to collect, also while
pfsync is sending out a bulk update. it also converts the ioctl handling
for flushing states to using the list instead of one of the RB trees
holding states. i'm sure everyone wants an omgoptimised state flushing
mechanism.

this is a work in progress because i've screwed up the pf_state
reference counting somehow. states end up with an extra reference, and i
can't see where that's coming from.

anyway. apart from the refcounting thing it seems to be working well,
and would take us a step closer to using mutexes for pf locks.

even if we throw this away, id still like to move the pf purge
processing from out of nettq0 into systq, just to avoid having a place
where a nettq thread has to spin waiting for the kernel lock.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1118
diff -u -p -r1.1118 pf.c
--- pf.c1 Jun 2021 09:57:11 -   1.1118
+++ pf.c2 Jun 2021 07:52:21 -
@@ -308,7 +308,7 @@ static __inline void pf_set_protostate(s
 struct pf_src_tree tree_src_tracking;
 
 struct pf_state_tree_id tree_id;
-struct pf_state_queue state_list;
+struct pf_state_list pf_state_list = PF_STATE_LIST_INITIALIZER(pf_state_list);
 
 RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare);
 RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key);
@@ -440,6 +440,37 @@ pf_check_threshold(struct pf_threshold *
return (threshold->count > threshold->limit);
 }
 
+void
+pf_state_list_insert(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /*
+* we can always put states on the end of the list.
+*
+* things reading the list should take a read lock, then
+* the mutex, get the head and tail pointers, release the
+* mutex, and then they can iterate between the head and tail.
+*/
+
+   pf_state_ref(st); /* get a ref for the list */
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_INSERT_TAIL(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+}
+
+void
+pf_state_list_remove(struct pf_state_list *pfs, struct pf_state *st)
+{
+   /* states can only be removed when the write lock is held */
+   rw_assert_wrlock(>pfs_rwl);
+
+   mtx_enter(>pfs_mtx);
+   TAILQ_REMOVE(>pfs_list, st, entry_list);
+   mtx_leave(>pfs_mtx);
+
+   pf_state_unref(st); /* list no longer references the state */
+}
+
 int
 pf_src_connlimit(struct pf_state **state)
 {
@@ -986,7 +1017,7 @@ pf_state_insert(struct pfi_kif *kif, str
PF_STATE_EXIT_WRITE();
return (-1);
}
-   TAILQ_INSERT_TAIL(_list, s, entry_list);
+   

Re: [External] : Re: parallel forwarding vs. bridges

2021-05-18 Thread Alexandr Nedvedicky
Hello,

just for the record...



> > in current tree the ether_input() is protected by NET_LOCK(), which is 
> > grabbed
> > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> > implications on smr read section above. The ting is the call to 
> > eb->eb_input()
> > can sleep now. This is something what needs to be avoided within smr 
> > section.
> 
> Is the new sleeping point introduced by the fact the PF_LOCK() is a
> rwlock?  Did you consider using a mutex, at least for the time being,
> in order to not run in such issues?

below is a diff, which trades both pf(4) rw-locks for mutexes.

diff compiles, it still needs testing/experimenting.

regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index ae7bb008351..30c78da7d16 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -146,8 +146,8 @@ TAILQ_HEAD(pf_tags, pf_tagname) pf_tags = 
TAILQ_HEAD_INITIALIZER(pf_tags),
  * grab the lock as writer. Whenever packet creates state it grabs pf_lock
  * first then it locks pf_state_lock as the writer.
  */
-struct rwlock   pf_lock = RWLOCK_INITIALIZER("pf_lock");
-struct rwlock   pf_state_lock = RWLOCK_INITIALIZER("pf_state_lock");
+struct mutexpf_lock = MUTEX_INITIALIZER(IPL_NET);
+struct mutexpf_state_lock = MUTEX_INITIALIZER(IPL_NET);
 
 #if (PF_QNAME_SIZE != PF_TAG_NAME_SIZE)
 #error PF_QNAME_SIZE must be equal to PF_TAG_NAME_SIZE
diff --git a/sys/net/pfvar_priv.h b/sys/net/pfvar_priv.h
index b1a122af409..c5ae392b468 100644
--- a/sys/net/pfvar_priv.h
+++ b/sys/net/pfvar_priv.h
@@ -39,7 +39,7 @@
 
 #include 
 
-extern struct rwlock pf_lock;
+extern struct mutex pf_lock;
 
 struct pf_pdesc {
struct {
@@ -104,52 +104,41 @@ extern struct timeout pf_purge_to;
 struct pf_state*pf_state_ref(struct pf_state *);
 voidpf_state_unref(struct pf_state *);
 
-extern struct rwlock   pf_lock;
-extern struct rwlock   pf_state_lock;
+extern struct mutexpf_lock;
+extern struct mutexpf_state_lock;
 
 #define PF_LOCK()  do {\
NET_ASSERT_LOCKED();\
-   rw_enter_write(_lock);   \
+   mtx_enter(_lock);\
} while (0)
 
 #define PF_UNLOCK()do {\
PF_ASSERT_LOCKED(); \
-   rw_exit_write(_lock);\
+   mtx_leave(_lock);\
} while (0)
 
-#define PF_ASSERT_LOCKED() do {\
-   if (rw_status(_lock) != RW_WRITE)\
-   splassert_fail(RW_WRITE,\
-   rw_status(_lock),__func__);\
-   } while (0)
+#define PF_ASSERT_LOCKED() (void)(0)
 
-#define PF_ASSERT_UNLOCKED()   do {\
-   if (rw_status(_lock) == RW_WRITE)\
-   splassert_fail(0, rw_status(_lock), __func__);\
-   } while (0)
+#define PF_ASSERT_UNLOCKED()   (void)(0)
 
 #define PF_STATE_ENTER_READ()  do {\
-   rw_enter_read(_state_lock);  \
+   mtx_enter(_state_lock);  \
} while (0)
 
 #define PF_STATE_EXIT_READ()   do {\
-   rw_exit_read(_state_lock);   \
+   mtx_leave(_state_lock);  \
} while (0)
 
 #define PF_STATE_ENTER_WRITE() do {\
-   rw_enter_write(_state_lock); \
+   mtx_enter(_state_lock);  \
} while (0)
 
 #define PF_STATE_EXIT_WRITE()  do {\
PF_ASSERT_STATE_LOCKED();   \
-   rw_exit_write(_state_lock);  \
+   mtx_leave(_state_lock);  \
} while (0)
 
-#define PF_ASSERT_STATE_LOCKED()   do {\
-   if (rw_status(_state_lock) != RW_WRITE)\
-   splassert_fail(RW_WRITE,\
-   rw_status(_state_lock), __func__);\
-   } while (0)
+#define PF_ASSERT_STATE_LOCKED()   (void)(0)
 
 extern void pf_purge_timeout(void *);
 extern void pf_purge(void *);



Re: [External] : Re: parallel forwarding vs. bridges

2021-05-17 Thread Alexandr Nedvedicky
Hello,

On Mon, May 17, 2021 at 08:19:37PM +0200, Alexander Bluhm wrote:
> On Mon, May 17, 2021 at 04:24:15PM +0200, Alexandr Nedvedicky wrote:
> > in current tree the ether_input() is protected by NET_LOCK(), which is 
> > grabbed
> > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> > implications on smr read section above. The ting is the call to 
> > eb->eb_input()
> > can sleep now. This is something what needs to be avoided within smr 
> > section.
> 
> Do you think the sleeping occures in PF_LOCK()?  Could it be that
> this rwlock did never sleep before, but was acquired instantly?
> 

yes, the sleep occurs on PF_LOCK(), right here in pf_test():
7052a = s->anchor.ptr;
7053#if NPFLOG > 0
7054pd.pflog |= s->log;
7055#endif  /* NPFLOG > 0 */
7056} else if (s == NULL) {
7057PF_LOCK();
7058have_pf_lock = 1;
7059action = pf_test_rule(, , , , 
,
7060);
7061s = pf_state_ref(s);
(gdb)

with exclusive NET_LOCK() the PF_LOCK() never blocks/sleeps, because
exclusive NET_LOCK() ensures there is only single packet running in
pf_test().  changing exclusive NET_LOCK() to reader lock allows more
packets to run in pf_test(), therefore PF_LOCK() can block.


> Having sleeps in the packet forwarding path is a bad idea.  And
> refcounting per packet looks like a workarund.  It does not make
> things faster.

I'm afraid sleeps (exclusive locks) can not be avoided with current data
structures we have in PF. for example if packet will be creating state,
then it must grab state lock exclusively.

we have to make sure the sleep operation won't occur in code, which
is protected by lock. Such risk comes from blocking malloc being
called by caller, which owns lock (either reader or writer).

> 
> How about implementing PF_LOCK() with a mutex?

As I've answered to mpi@ already, there are two locks, which
we need to change to mutexes:
PF_LOCK() and state lock
> 
> I made such a diff a few months ago.  It is not well tested yet.
> There were some problems when allocating pf mutex in ioctl path.
> mpi@ did not like it, as it removes the pf lock in pf ioctl and
> relies on net lock for reconfigure.

I've read through your diff. If I understand it right two things
are being changed:
PF_LOCK() is thrown away completely and we use NET_LOCK() instead

state_lock is changed to mutex

IMO removing PF_LOCK() is step backward. I think every subsystem (including
PF) should provide its own lock and should not piggy back on NET_LOCK().
> 
> It is a design question.  Do we want to keep net lock for reconfigure?

No, we don't want NET_LOCK() for reconfigure. configuration changes
should be synchronized using PF_LOCK()

> If we process packets with shared netlock and reconfigure pf with
> exclusive netlock this looks reasonable.
> 
> Do we want to go this way?  Then I will polish and test this diff.

no I think we don't want to go that way.

if it comes to pf(4) configuration is more colorful.  the ioctl, which updates
firewall configuration grabs lock exclusively (be it current pf_lock or
net_lock). For certain operations (queue management and tables) we also
allocate memory _without_ NOSLEEP. We may end up in situation when we are 
sleeping in malloc while holding lock, which we share with packets.

Therefore I think we actually need extra rw-lock just for ioctl.
this new lock will serialize all callers which will come from userland.
we don't mind to sleep on malloc with ioctl-rw lock held exclusively.

as soon as memory for desired operation will be acquired, then it
will be time to grab PF_LOCK() as a writer and update firewall config.
this way we should avoid blocking processing of packets due to
memory allocations.

I think Vitaliy implemented something similar for other subsystem,
can't remember, which one (?pipex?).

thanks and
regards
sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-05-17 Thread Alexandr Nedvedicky
On Mon, May 17, 2021 at 08:27:12PM +0200, Martin Pieuchot wrote:
> On 17/05/21(Mon) 19:52, Alexandr Nedvedicky wrote:
> > [...] 
> > I don't mind to trade pf_lock and pf_state_lock for mutexes, however
> > I see such step is kind of 'NO-OP'. We do have sufficient measure
> > currently, which is: keep NET_LOCK() as is. May be I'm not seeing
> > your idea/plan behind changing pf's rw-locks to mutexes. If you feel
> > there is a benefit to go that way, then let's do it, but I'd like
> > to understand where we will be going/what to expect.
> 
> I've no idea or plan.  I'm just pointing out that using rwlocks, for the
> moment, add extra work.  If it's easy to use mutexes then you might want
> to start with that.  The whole network processing path assumes it runs
> without sleeping.  I've no idea what can happen when this assumption
> will be broken.
> 

I see. I keep forgetting about local bound traffic, which is
far more complex than forwarding scenario.

> I'm well aware that using a single big pf lock is not the best for
> performances, but maybe it's easier to do baby steps.
> 
> That said, I don't want to stop or discourage anyone.  If you're
> confident enough that rwlocks are the way to go, then please, go
> ahead.
> 

I think we can keep mutexes as a kind of fallback plan if we decide
to commit bluhm's diff.

thanks and
regards
sashan



Re: [External] : Re: parallel forwarding vs. bridges

2021-05-17 Thread Martin Pieuchot
On 17/05/21(Mon) 19:52, Alexandr Nedvedicky wrote:
> [...] 
> I don't mind to trade pf_lock and pf_state_lock for mutexes, however
> I see such step is kind of 'NO-OP'. We do have sufficient measure
> currently, which is: keep NET_LOCK() as is. May be I'm not seeing
> your idea/plan behind changing pf's rw-locks to mutexes. If you feel
> there is a benefit to go that way, then let's do it, but I'd like
> to understand where we will be going/what to expect.

I've no idea or plan.  I'm just pointing out that using rwlocks, for the
moment, add extra work.  If it's easy to use mutexes then you might want
to start with that.  The whole network processing path assumes it runs
without sleeping.  I've no idea what can happen when this assumption
will be broken.

I'm well aware that using a single big pf lock is not the best for
performances, but maybe it's easier to do baby steps.

That said, I don't want to stop or discourage anyone.  If you're
confident enough that rwlocks are the way to go, then please, go
ahead.

Cheers,
Martin



Re: parallel forwarding vs. bridges

2021-05-17 Thread Alexander Bluhm
On Mon, May 17, 2021 at 04:24:15PM +0200, Alexandr Nedvedicky wrote:
> in current tree the ether_input() is protected by NET_LOCK(), which is grabbed
> by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> implications on smr read section above. The ting is the call to eb->eb_input()
> can sleep now. This is something what needs to be avoided within smr section.

Do you think the sleeping occures in PF_LOCK()?  Could it be that
this rwlock did never sleep before, but was acquired instantly?

Having sleeps in the packet forwarding path is a bad idea.  And
refcounting per packet looks like a workarund.  It does not make
things faster.

How about implementing PF_LOCK() with a mutex?

I made such a diff a few months ago.  It is not well tested yet.
There were some problems when allocating pf mutex in ioctl path.
mpi@ did not like it, as it removes the pf lock in pf ioctl and
relies on net lock for reconfigure.

It is a design question.  Do we want to keep net lock for reconfigure?
If we process packets with shared netlock and reconfigure pf with
exclusive netlock this looks reasonable.

Do we want to go this way?  Then I will polish and test this diff.

bluhm

Index: net/pf.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
retrieving revision 1.1116
diff -u -p -r1.1116 pf.c
--- net/pf.c27 Apr 2021 09:38:29 -  1.1116
+++ net/pf.c17 May 2021 17:38:20 -
@@ -6819,6 +6819,8 @@ pf_test(sa_family_t af, int fwdir, struc
if (!pf_status.running)
return (PF_PASS);
 
+   NET_ASSERT_LOCKED();
+
 #if NCARP > 0
if (ifp->if_type == IFT_CARP &&
(ifp0 = if_get(ifp->if_carpdevidx)) != NULL) {
Index: net/pf_ioctl.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf_ioctl.c,v
retrieving revision 1.363
diff -u -p -r1.363 pf_ioctl.c
--- net/pf_ioctl.c  9 Feb 2021 23:37:54 -   1.363
+++ net/pf_ioctl.c  17 May 2021 17:38:20 -
@@ -146,8 +146,8 @@ TAILQ_HEAD(pf_tags, pf_tagname) pf_tags 
  * grab the lock as writer. Whenever packet creates state it grabs pf_lock
  * first then it locks pf_state_lock as the writer.
  */
-struct rwlock   pf_lock = RWLOCK_INITIALIZER("pf_lock");
-struct rwlock   pf_state_lock = RWLOCK_INITIALIZER("pf_state_lock");
+struct mutex   pf_lock = MUTEX_INITIALIZER(IPL_SOFTNET);
+struct mutex   pf_state_lock = MUTEX_INITIALIZER(IPL_SOFTNET);
 
 #if (PF_QNAME_SIZE != PF_TAG_NAME_SIZE)
 #error PF_QNAME_SIZE must be equal to PF_TAG_NAME_SIZE
@@ -1009,7 +1009,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
 
case DIOCSTART:
NET_LOCK();
-   PF_LOCK();
if (pf_status.running)
error = EEXIST;
else {
@@ -1023,13 +1022,11 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
pf_create_queues();
DPFPRINTF(LOG_NOTICE, "pf: started");
}
-   PF_UNLOCK();
NET_UNLOCK();
break;
 
case DIOCSTOP:
NET_LOCK();
-   PF_LOCK();
if (!pf_status.running)
error = ENOENT;
else {
@@ -1038,7 +1035,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
pf_remove_queues();
DPFPRINTF(LOG_NOTICE, "pf: stopped");
}
-   PF_UNLOCK();
NET_UNLOCK();
break;
 
@@ -1048,7 +1044,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
u_int32_tnr = 0;
 
NET_LOCK();
-   PF_LOCK();
pq->ticket = pf_main_ruleset.rules.active.ticket;
 
/* save state to not run over them all each time? */
@@ -1058,7 +1053,6 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
nr++;
}
pq->nr = nr;
-   PF_UNLOCK();
NET_UNLOCK();
break;
}
@@ -1069,10 +1063,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
u_int32_tnr = 0;
 
NET_LOCK();
-   PF_LOCK();
if (pq->ticket != pf_main_ruleset.rules.active.ticket) {
error = EBUSY;
-   PF_UNLOCK();
NET_UNLOCK();
break;
}
@@ -1083,12 +1075,10 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
qs = TAILQ_NEXT(qs, entries);
if (qs == NULL) {
error = EBUSY;
-   PF_UNLOCK();
NET_UNLOCK();
break;
}
memcpy(>queue, qs, 

Re: [External] : Re: parallel forwarding vs. bridges

2021-05-17 Thread Alexandr Nedvedicky
Hello,

On Mon, May 17, 2021 at 06:59:36PM +0200, Martin Pieuchot wrote:
> On 17/05/21(Mon) 16:24, Alexandr Nedvedicky wrote:
> > Hrvoje,
> > 
> > managed to trigger diagnostic panics with diff [1] sent by bluhm@
> > some time ago. The panic Hrvoje sees comes from ether_input() here:
> > 
> >  414 
> >  415 /*
> >  416  * Third phase: bridge processing.
> >  417  *
> >  418  * Give the packet to a bridge interface, ie, bridge(4),
> >  419  * switch(4), or tpmr(4), if it is configured. A bridge
> >  420  * may take the packet and forward it to another port, or it
> >  421  * may return it here to ether_input() to support local
> >  422  * delivery to this port.
> >  423  */
> >  424 
> >  425 ac = (struct arpcom *)ifp;
> >  426 
> >  427 smr_read_enter();
> >  428 eb = SMR_PTR_GET(>ac_brport);
> >  429 if (eb != NULL) {
> >  430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
> >  431 if (m == NULL) {
> >  432 smr_read_leave();
> >  433 return;
> >  434 }
> >  435 }
> >  436 smr_read_leave();
> >  437 
> > 
> > in current tree the ether_input() is protected by NET_LOCK(), which is 
> > grabbed
> > by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> > ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> > implications on smr read section above. The ting is the call to 
> > eb->eb_input()
> > can sleep now. This is something what needs to be avoided within smr 
> > section.
> 
> Is the new sleeping point introduced by the fact the PF_LOCK() is a
> rwlock?  Did you consider using a mutex, at least for the time being,
> in order to not run in such issues?

there are two locks in pf_test():
PF_LOCK()

PF_STATE_LOCK()

we would have to convert both locks into mutexes.

trading PF_STATE_LOCK() into mutex is step backward IMO.
bluhm's parallel diff enables PF to run in parallel for packets.
if packet matches existing state, it just grabs state lock as
a reader. This brings huge benefit, because majority of packets
just do the state look up. Only some packets need to go expensive
way through rules and state creation, which involves PF_LOCK() and
write lock on state rw.

as readers and are done with PF.

trading both locks for mutexes will do the trick too. However I'm not
sure I see a benefit there (when comparing to current NET_LOCK() we have
in tree).

I think all safety belts we have kicked in. We learned smr read block is
not an option for ether_input() when comes to bridges.  perhaps alternative
solution would be to introduce an RW-lock, which would protect 'ac->br_port'.
Such R-lock would cover call ->eb_input(). However I'm not sure about it.

I don't mind to trade pf_lock and pf_state_lock for mutexes, however
I see such step is kind of 'NO-OP'. We do have sufficient measure
currently, which is: keep NET_LOCK() as is. May be I'm not seeing
your idea/plan behind changing pf's rw-locks to mutexes. If you feel
there is a benefit to go that way, then let's do it, but I'd like
to understand where we will be going/what to expect.

thanks and
regards
sashan



Re: parallel forwarding vs. bridges

2021-05-17 Thread Hrvoje Popovski
On 17.5.2021. 16:24, Alexandr Nedvedicky wrote:
> Hrvoje,
> 
> managed to trigger diagnostic panics with diff [1] sent by bluhm@
> some time ago. The panic Hrvoje sees comes from ether_input() here:

> [1] https://marc.info/?l=openbsd-tech=161903387904923=2
> 
> [2] https://marc.info/?l=openbsd-tech=162099270106067=2

for those interested, I sent emails to a couple of developers with the
following panics

c/p

Hi guys,

i'm testing bluhm@ parallel diff with pf and tpmr/veb/bridge.
if i'm sending traffic over tmpr or veb, the same second i enable pf,
box panic.. see attachment ..

with bridge i can't panic box, but it seems that performance is even
worse than without parallel diff ... i will play with it more, maybe
it's just some glitch ...
pf and tpmr

r620-1# panic: kernel diagnostic assertion 
"curcpu()->ci_schedstate.spc_smrdepth == 0" failed: file "/sys/kern/sub
r_xxx.c", line 163
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
  11538  95266  0 0x14000  0x2002  softnet
 239565   4172  0 0x14000  0x2001  softnet
db_enter() at db_enter+0x10
panic(81e0198a) at panic+0x12a
__assert(81e6f4ee,81e93713,a3,81ea877c) at __assert+0x2b
assertwaitok() at assertwaitok+0xdc
mi_switch() at mi_switch+0x40
sleep_finish(800023877740,1) at sleep_finish+0x11c
rw_enter(8212e900,1) at rw_enter+0x229
pf_test(2,1,80082048,800023877978) at pf_test+0x1055
tpmr_pf(80082048,1,fd80cccb8c00) at tpmr_pf+0x7e
tpmr_input(80082048,fd80cccb8c00,90e2ba1d4f89,80680f00) at 
tpmr_input+0x124
ether_input(80082048,fd80cccb8c00) at ether_input+0xf5
if_input_process(80082048,800023877af8) at if_input_process+0x6f
ifiq_process(80086000) at ifiq_process+0x69
taskq_thread(80030300) at taskq_thread+0x9f
end trace frame: 0x0, count: 1
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{0}>


ddb{0}> show panic
kernel diagnostic assertion "curcpu()->ci_schedstate.spc_smrdepth == 0" failed:
 file "/sys/kern/subr_xxx.c", line 163


ddb{0}> show locks
shared rwlock netlock r = 0 (0x8212eab0)
#0  witness_lock+0x339
#1  if_input_process+0x43
#2  ifiq_process+0x69
#3  taskq_thread+0x9f
#4  proc_trampoline+0x1c
shared rwlock softnet r = 0 (0x80030370)
#0  witness_lock+0x339
#1  taskq_thread+0x92
#2  proc_trampoline+0x1c
exclusive sched_lock _lock r = 0 (0x8230a100)
#0  witness_lock+0x339
#1  sleep_setup+0xa5
#2  rw_enter+0x208
#3  pf_test+0x1055
#4  tpmr_pf+0x7e
#5  tpmr_input+0x124
#6  ether_input+0xf5
#7  if_input_process+0x6f
#8  ifiq_process+0x69
#9  taskq_thread+0x9f
#10 proc_trampoline+0x1c


ddb{0}> show reg
rdi   0x820f1c80kprintf_mutex
rsi  0x5
rbp   0x8000238775a0
rbx   0x8000238775b0
rdx   0x8000
rcx0x206
rax  0x1
r80x800023877560
r9 0
r100
r11   0x80ef1bc6ac87b49d
r12 0x38
r13   0x800023877650
r140x100
r15   0x81e0198acmd0646_9_tim_udma+0x24a93
rip   0x81106070db_enter+0x10
cs   0x8
rflags 0x282
rsp   0x8000238775a0
ss  0x10
db_enter+0x10:  popq%rbp



ddb{0}> show mbuf
mbuf 0x81106070
m_type: -13108  m_flags: c3cc
m_next: 0x1d3b4c241c334c5d  m_nextpkt: 0x117400fdffac
m_data: 0x  m_len: 3435973836
m_dat: 0x81106090   m_pktdat: 0x811060e0


ddb{0}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 80810  268029  1  0  30x100083  ttyin ksh
 90433  180245  1  0  30x100098  poll  cron
 921768747  42701 95  30x100092  kqreadsmtpd
 83989   36449  42701103  30x100092  kqreadsmtpd
 82331  325465  42701 95  30x100092  kqreadsmtpd
 28903  422168  42701 95  30x100092  kqreadsmtpd
  5781   10814  42701 95  30x100092  kqreadsmtpd
 70395   59130  42701 95  30x100092  kqreadsmtpd
 42701  329509  1  0  30x100080  kqreadsmtpd
 47357  342015  1  0  30x80  selectsshd
 91828  134960  1  0  30x100080  poll  ntpd
 42264  230567  72970 83  30x100092  poll  ntpd
 72970  150962  1 83  30x100092  poll  ntpd
 23464  496327  92870 73  30x100090  kqreadsyslogd
 92870  181949  1  0  30x100082  netio syslogd
 15150   95273

Re: parallel forwarding vs. bridges

2021-05-17 Thread Martin Pieuchot
On 17/05/21(Mon) 16:24, Alexandr Nedvedicky wrote:
> Hrvoje,
> 
> managed to trigger diagnostic panics with diff [1] sent by bluhm@
> some time ago. The panic Hrvoje sees comes from ether_input() here:
> 
>  414 
>  415 /*
>  416  * Third phase: bridge processing.
>  417  *
>  418  * Give the packet to a bridge interface, ie, bridge(4),
>  419  * switch(4), or tpmr(4), if it is configured. A bridge
>  420  * may take the packet and forward it to another port, or it
>  421  * may return it here to ether_input() to support local
>  422  * delivery to this port.
>  423  */
>  424 
>  425 ac = (struct arpcom *)ifp;
>  426 
>  427 smr_read_enter();
>  428 eb = SMR_PTR_GET(>ac_brport);
>  429 if (eb != NULL) {
>  430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
>  431 if (m == NULL) {
>  432 smr_read_leave();
>  433 return;
>  434 }
>  435 }
>  436 smr_read_leave();
>  437 
> 
> in current tree the ether_input() is protected by NET_LOCK(), which is grabbed
> by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
> ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
> implications on smr read section above. The ting is the call to eb->eb_input()
> can sleep now. This is something what needs to be avoided within smr section.

Is the new sleeping point introduced by the fact the PF_LOCK() is a
rwlock?  Did you consider using a mutex, at least for the time being,
in order to not run in such issues?



parallel forwarding vs. bridges

2021-05-17 Thread Alexandr Nedvedicky
Hrvoje,

managed to trigger diagnostic panics with diff [1] sent by bluhm@
some time ago. The panic Hrvoje sees comes from ether_input() here:

 414 
 415 /*
 416  * Third phase: bridge processing.
 417  *
 418  * Give the packet to a bridge interface, ie, bridge(4),
 419  * switch(4), or tpmr(4), if it is configured. A bridge
 420  * may take the packet and forward it to another port, or it
 421  * may return it here to ether_input() to support local
 422  * delivery to this port.
 423  */
 424 
 425 ac = (struct arpcom *)ifp;
 426 
 427 smr_read_enter();
 428 eb = SMR_PTR_GET(>ac_brport);
 429 if (eb != NULL) {
 430 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
 431 if (m == NULL) {
 432 smr_read_leave();
 433 return;
 434 }
 435 }
 436 smr_read_leave();
 437 

in current tree the ether_input() is protected by NET_LOCK(), which is grabbed
by caller as a writer. bluhm's diff changes NET_LOCK() readlock, so
ether_input() can run concurrently. Switching NET_LOCK() to r-lock has
implications on smr read section above. The ting is the call to eb->eb_input()
can sleep now. This is something what needs to be avoided within smr section.

diff below introduces a reference counting to bridge port using atomic ops.
the section above is changed to:

 415 /*
 416  * Third phase: bridge processing.
 417  *
 418  * Give the packet to a bridge interface, ie, bridge(4),
 419  * switch(4), or tpmr(4), if it is configured. A bridge
 420  * may take the packet and forward it to another port, or it
 421  * may return it here to ether_input() to support local
 422  * delivery to this port.
 423  */
 424 
 425 ac = (struct arpcom *)ifp;
 426 
 427 smr_read_enter();
 428 eb = SMR_PTR_GET(>ac_brport);
 429 if (eb != NULL)
 430 eb->eb_port_take(eb->eb_port);
 431 smr_read_leave();
 432 if (eb != NULL) {
 433 m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
 434 eb->eb_port_rele(eb->eb_port);
 435 if (m == NULL) {
 436 return;
 437 }
 438 }
 439 

each bridge module must provide its own implementation for
eb_port_take()/eb_port_rele(). Not sure if it is a way to go. According to
testing done by Hrvoje the diagnostic panic disappears. So we can say diff
solves the problem, but there might be better ways to solve it.

there are other bugs to kill as pointed out here [2].

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-tech=161903387904923=2

[2] https://marc.info/?l=openbsd-tech=162099270106067=2

8<---8<---8<--8<

diff --git a/sys/net/if_bridge.c b/sys/net/if_bridge.c
index 7dbcf4a2ab0..2afcbb296a0 100644
--- a/sys/net/if_bridge.c
+++ b/sys/net/if_bridge.c
@@ -138,6 +138,8 @@ int bridge_ipsec(struct ifnet *, struct ether_header *, 
int, struct llc *,
 #endif
 int bridge_clone_create(struct if_clone *, int);
 intbridge_clone_destroy(struct ifnet *);
+void   bridge_take(void *);
+void   bridge_rele(void *);
 
 #defineETHERADDR_IS_IP_MCAST(a) \
/* struct etheraddr *a; */  \
@@ -152,6 +154,8 @@ struct if_clone bridge_cloner =
 
 const struct ether_brport bridge_brport = {
bridge_input,
+   bridge_take,
+   bridge_rele,
NULL,
 };
 
@@ -2049,3 +2053,15 @@ bridge_send_icmp_err(struct ifnet *ifp,
  dropit:
m_freem(n);
 }
+
+void
+bridge_take(void *unused)
+{
+   return;
+}
+
+void
+bridge_rele(void *unused)
+{
+   return;
+}
diff --git a/sys/net/if_ethersubr.c b/sys/net/if_ethersubr.c
index c9493b7f857..4b975e2d4a1 100644
--- a/sys/net/if_ethersubr.c
+++ b/sys/net/if_ethersubr.c
@@ -426,14 +426,16 @@ ether_input(struct ifnet *ifp, struct mbuf *m)
 
smr_read_enter();
eb = SMR_PTR_GET(>ac_brport);
+   if (eb != NULL)
+   eb->eb_port_take(eb->eb_port);
+   smr_read_leave();
if (eb != NULL) {
m = (*eb->eb_input)(ifp, m, dst, eb->eb_port);
+   eb->eb_port_rele(eb->eb_port);
if (m == NULL) {
-   smr_read_leave();
return;
}
}
-   smr_read_leave();
 
/*
 * Fourth phase: drop service delimited packets.
diff --git a/sys/net/if_switch.c b/sys/net/if_switch.c
index 22aecdc6b75..ca24597d1f7 100644
--- a/sys/net/if_switch.c
+++ b/sys/net/if_switch.c
@@ -110,6 +110,8 @@ struct mbuf
 voidswitch_flow_classifier_dump(struct switch_softc *,
struct switch_flow_classify *);
 voidswitchattach(int);
+void   switch_take(void *);
+void   switch_rele(void *);
 
 struct