Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-13 Thread Alexandr Nedvedicky
Hello Sebastien,

> > thank you very much for doing the test for me.
> > patch below should kill the bug.
> > 
> > patch applies to current.
> 
> I confirm that since 4 days the host is stable without any issues.
> 
> Your diff is ok semarie@
> 

thank you for OK. unfortunately one more tweak is needed. The change
you were testing allows race, where we leak a reference, when
two packets are trying to bind state keys with each other at
the same time.

I've found the leak on test box provided by Hrvoje, where I'm
able to put the code under the stress.

this got fixed by diff below. I believe patch below should keep
your box happy and reference leak plugged. I would like to
ask you if you can put it in and give it a try.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 23eebf4a274..bcd81a83300 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -7373,14 +7373,21 @@ pf_state_key_link_reverse(struct pf_state_key *sk, 
struct pf_state_key *skrev)
old_reverse = atomic_cas_ptr(>reverse, NULL, skrev);
if (old_reverse != NULL)
KASSERT(old_reverse == skrev);
-   else
+   else {
pf_state_key_ref(skrev);
 
-   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == sk);
-   else
+   /*
+* NOTE: if sk == skrev, then KASSERT() below holds true, we
+* still want to grab a reference in such case, because
+* pf_state_key_unlink_reverse() does not check whether keys
+* are identical or not.
+*/
+   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
+   if (old_reverse != NULL)
+   KASSERT(old_reverse == sk);
+
pf_state_key_ref(sk);
+   }
 }
 
 #if NPFLOG > 0



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-12 Thread Sebastien Marie
On Sat, May 08, 2021 at 12:31:50AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> thank you very much for doing the test for me.
> patch below should kill the bug.
> 
> patch applies to current.

I confirm that since 4 days the host is stable without any issues.

Your diff is ok semarie@

Thanks !

> 
> 8<---8<---8<--8<
> diff --git a/sys/net/pf.c b/sys/net/pf.c
> index 23eebf4a274..275ef8a6157 100644
> --- a/sys/net/pf.c
> +++ b/sys/net/pf.c
> @@ -7376,11 +7376,15 @@ pf_state_key_link_reverse(struct pf_state_key *sk, 
> struct pf_state_key *skrev)
>   else
>   pf_state_key_ref(skrev);
>  
> + /*
> +  * NOTE: if sk == skrev, then KASSERT() below holds true. We still
> +  * want to grab a reference in such case.
> +  */
>   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
>   if (old_reverse != NULL)
>   KASSERT(old_reverse == sk);
> - else
> - pf_state_key_ref(sk);
> +
> + pf_state_key_ref(sk);
>  }
>  
>  #if NPFLOG > 0

-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-07 Thread Olivier Cherrier
On Sat, May 08, 2021 at 12:31:50AM +0200, alexandr.nedvedi...@oracle.com wrote:
> Hello,
> 
> thank you very much for doing the test for me.
> patch below should kill the bug.
> 
> patch applies to current.
> 
> thanks and
> regards
> sashan


Hello Alexandr,

I'll give it a try :)

In the meantime, I got a similar crash than Sebastien:


ddb> show panic
kernel diagnostic assertion "sk != skrev" failed: file "/sys/net/pf.c", line 
7372
ddb> trace
db_enter() at db_enter+0x4
panic(d0bb8d61) at panic+0xd3
__assert(d0c1f6ee,d0c06097,1ccc,d0bcb680) at __assert+0x19
pf_state_key_link_reverse(d16fddd0,d16fddd0) at pf_state_key_link_reverse+0x6e
pf_find_state(f397efe4,f397ef6c,f397efd4) at pf_find_state+0x1b9
pf_test_state(f397efe4,f397efd4,f397efe2,0) at pf_test_state+0x10e
pf_test(2,2,d19ea030,f397f0ec) at pf_test+0xcb4
bridge_ip(d1d61000,2,d19ea030,f397f12c,d1d7ea00) at bridge_ip+0x43b
bridge_broadcast(d1d61000,d1d55800,f397f12c,d1d7ea00) at bridge_broadcast+0x191
bridgeintr_frame(d1d61000,d1d55800,d1d7ea00) at bridgeintr_frame+0x2cc
bridge_process(d1d55800,d1d7ea00) at bridge_process+0x2be
bridgeintr() at bridgeintr+0x70
if_netisr(0) at if_netisr+0xc3
taskq_thread(d19e1040) at taskq_thread+0x51
ddb> 


Thanks,
Best.

-- 
Olivier Cherrier
Phone: +352691570680
mailto:o...@symacx.com



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-07 Thread Sebastien Marie
On Fri, May 07, 2021 at 02:16:35AM +0200, Alexandr Nedvedicky wrote:
> 
> I'd like to ask you for yet another brave test. just to verify
> the story I dream of above really happens. The plan is to put
> yet another KASSERT() to pf_state_key_link_reverse():
> 
>   KASSERT(sk != skrev);
> 
> I expect the KASSERT will fire sooner or later. I'm not sure
> which packet could trigger such condition (sk == skrev), I
> suspect this could be kind of multicast/broadcast packet.
> Hence I'd like to ask you to give a try KASSSERT() above.
> 

panic: kernel diagnostic assertion "sk != skrev" failed: file 
"/usr/src/sys/net/pf.c", line 7372
Stopped at  db_enter+0x10:  popq%rbp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
* 51803  77372  0 0x14000  0x2001K softnet
db_enter() at db_enter+0x10
panic(81e05c77) at panic+0x12a
__assert(81e706d5,81de2d6c,1ccc,81e1a122) at 
__assert+0x2b
pf_state_key_link_reverse(fd81091c1000,fd81091c1000) at 
pf_state_key_link_reverse+0x8f
pf_find_state(80002246e240,80002246e178,80002246e330) at 
pf_find_state+0x204
pf_test_state(80002246e240,80002246e330,80002246e33e,0) at 
pf_test_state+0x178
pf_test(2,2,800bb050,80002246e3c8) at pf_test+0xcec
bridge_ip(80a53000,2,800bb050,80002246e4c0,fd80be0b6000)
 at bridge_ip+0x609
bridge_broadcast(80a53000,80099048,80002246e4c0,fd80cf52f300)
 at bridge_broadcast+0x1a2
bridgeintr_frame(80a53000,80099048,fd80cf52f300) at 
bridgeintr_frame+0x2fa
bridge_process(80099048,fd80cf52f300) at 
bridge_process+0x2d6bridgeintr() at bridgeintr+0x84
if_netisr(0) at if_netisr+0xe9
taskq_thread(80033000) at taskq_thread+0x9f
end trace frame: 0x0, count: 1
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.
ddb{1}> 

Thanks.
-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Alexandr Nedvedicky
Hello Olivier and Sebastien,

I took a look at old version of pf_state_key_link_reverse(),
before my commit [1] changed it. The answer was there:

7368 void
7369 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key 
*skrev)
7370 {
7371 /* Note that sk and skrev may be equal, then we refcount twice. */
7372 KASSERT(sk != skrev);
7373 KASSERT(sk->reverse == NULL);
7374 KASSERT(skrev->reverse == NULL);
7375 sk->reverse = pf_state_key_ref(skrev);
7376 skrev->reverse = pf_state_key_ref(sk);
7377 }

comment at line 7371 says the skrev and sk may be equal.

Taking a look at current pf_state_key_link_reverse(), which is in the tree
right now, behaves differently to old function when sk and skrev are
identical:

7368 void
7369 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key 
*skrev)
7370 {
7371 struct pf_state_key *old_reverse;
7372 
7373 old_reverse = atomic_cas_ptr(>reverse, NULL, skrev);
7374 if (old_reverse != NULL)
7375 KASSERT(old_reverse == skrev);
7376 else
7377 pf_state_key_ref(skrev);
7378 
7379 old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
7380 if (old_reverse != NULL)
7381 KASSERT(old_reverse == sk);
7382 else
7383 pf_state_key_ref(sk);
7384 }

note that if both keys are identical, we are going to grab just single
reference. The story goes like that:

we assume sk == skrev

sk->reverse == NULL when we do  atomic_cas() at line 7373
the cas sets sk->reverse to skrev, which is equal to sk

the old_reverse is NULL when we reach 7374, therefore we
grab reference to skrev

at line 7379 we do cas_ptr on skrev->reverse.
because sk == skrev, the skrev->reverse is not NULL.

the old_reverse is not NULL when we reach 7380,
old_reverse equals to skrev (and sk)

old_reverse == skrev == sk

in this case we do KASSERT(old_reverse == sk) at line 7381,
the assertion holds so we just return from the function
without getting extra reference. this is wrong.

I'd like to ask you for yet another brave test. just to verify
the story I dream of above really happens. The plan is to put
yet another KASSERT() to pf_state_key_link_reverse():

KASSERT(sk != skrev);

I expect the KASSERT will fire sooner or later. I'm not sure
which packet could trigger such condition (sk == skrev), I
suspect this could be kind of multicast/broadcast packet.
Hence I'd like to ask you to give a try KASSSERT() above.

diff below is against current tree. It adds desired assert to
your pf_state_key_link_reverse().

thank you very much for your help.

regards
sashan

[1] https://marc.info/?l=openbsd-cvs=161951631726837=2

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 23eebf4a274..298568dec28 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -7368,19 +7368,12 @@ pf_inp_unlink(struct inpcb *inp)
 void
 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev)
 {
-   struct pf_state_key *old_reverse;
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, skrev);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == skrev);
-   else
-   pf_state_key_ref(skrev);
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == sk);
-   else
-   pf_state_key_ref(sk);
+   /* Note that sk and skrev may be equal, then we refcount twice. */
+   KASSERT(sk != skrev);
+   KASSERT(sk->reverse == NULL);
+   KASSERT(skrev->reverse == NULL);
+   sk->reverse = pf_state_key_ref(skrev);
+   skrev->reverse = pf_state_key_ref(sk);
 }
 
 #if NPFLOG > 0



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Olivier Cherrier
On Thu, May 06, 2021 at 04:50:31PM +0200, alexandr.nedvedi...@oracle.com wrote:
> thank you for your help with this. I have not heard back from Sebastien 
> yet.
> one more question:
>   are you building your bsd kernel with DIAGNOSTIC option enabled?

It is the GENERIC kernel. So with DIAGNOSTIC:
$ grep DIAG /sys/conf/GENERIC  
option  DIAGNOSTIC  # internal consistency checks
$


Thanks,
Best.

-- 
Olivier Cherrier
Phone: +352691570680
mailto:o...@symacx.com



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Sebastien Marie
On Thu, May 06, 2021 at 06:10:39PM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > > to be honest I have no idea what could be causing problems on those two 
> > > fairly
> > > distinct machines. The strange thing is that pf_test() currently does not 
> > > run in
> > > parallel. I don't quite understand why reverting my earlier change helps 
> > > here.
> > 
> > it could be two differents ways to trigger a bug somewhere else that
> > your commit expose.
> > 
> > the panic doesn't trigger in the same way on both machines:
> > - Olivier's machine seems to trigger it quickly (after some minutes)
> > - mine relatively slowly (~ once a day)
> 
> Olivier's machine acts as AP, so it forwards packets between interfaces.
> 
> If I remember correctly your machine is laptop/workstation, which
> does not forward traffic. 

it doesn't forward, but it acts as a bridge: 2 physical networks cards
grouped in a bridge(4) with only few traffic (a network printer is on
other side, the bridge(4) is here because I had a sparse network card
and no physical-switch to put here).

> the function, which we change back and forth here is
> pf_state_key_link_reverse(), which is being called from pf_find_state() 
> here:
> 
> 1085 
> 1086 if (sk == NULL) {
> 1087 if ((sk = RB_FIND(pf_state_tree, _statetbl,
> 1088 (struct pf_state_key *)key)) == NULL)
> 1089 return (PF_DROP);
> 1090 if (pd->dir == PF_OUT && pkt_sk &&
> 1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir) 
> == 0)
> 1092 pf_state_key_link_reverse(sk, pkt_sk);
> 1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp &&
> 1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp)
> 1095 pf_state_key_link_inpcb(sk, 
> pd->m->m_pkthdr.pf.inp);
> 1096 }
> 1097 
> 
> the story in human words goes as follows:
> 
>   sk == NULL -> no matching state key was attached to packet, Thus we
>   have to search state key in state tree using RB_FIND()
> 
>   if we could find state key for packet in table, then we will try
>   to set up a 'shortcut', which can save us RB_FIND() later.
> 
>   1090 - 1092
>   the shortcut can be set up for outbound packet only (pd->dir PF_OUT),
>   which is also being forwarded (pkt_sk != NULL, indicates we are seeing
>   the packet for the second time pkt_sk holds state key for inbound
>   direction).  pf_compare_state_keys() is sanity check, it leaves a
>   debug message on system console on failure.
> 
>   So if it is outbound forwarded packet, we've seen earlier, we
>   set up a reverse link to save one RB_FIND() operation on next
>   forwarded packet, which matches the same state.
> 
>   1093 - 1095
>   creates similar shortcut for local bound packets. We put reference 
>   to state key into PCB linked to socket. This will save us RB_FIND()
>   operation for next local outbound, which matches the same state.
> 
> 
> given the bug seems to be triggered/uncovered by pf_state_key_link_reverse()
> is there any chance your laptop/workstation occasionally forwards packets?
> like doing NAT for vmd/qemu virtual machine?

I don't know bridge(4) internals, but it could make sense that it is
using such functions.

> if it is not the case then the question is how does it come we run
> pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line 
> 1090.
> 
> > 
> > I could try to run with your commit and see if I could trigger it more
> > easily or found some elements influencing it. I could try with GENERIC
> > for example to see if I still trigger the same assert() or if it is
> > more like Olivier.
> 
> I need to think of how to further debug the thing.
> 
> > 
> > my LAN was several hosts with the same kernel and only this machine
> > trigger the panic, so it shouldn't be strictly linked to the
> > environment.
> > 

Thanks
-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Alexandr Nedvedicky
Hello,


> > to be honest I have no idea what could be causing problems on those two 
> > fairly
> > distinct machines. The strange thing is that pf_test() currently does not 
> > run in
> > parallel. I don't quite understand why reverting my earlier change helps 
> > here.
> 
> it could be two differents ways to trigger a bug somewhere else that
> your commit expose.
> 
> the panic doesn't trigger in the same way on both machines:
> - Olivier's machine seems to trigger it quickly (after some minutes)
> - mine relatively slowly (~ once a day)

Olivier's machine acts as AP, so it forwards packets between interfaces.

If I remember correctly your machine is laptop/workstation, which
does not forward traffic. 

the function, which we change back and forth here is
pf_state_key_link_reverse(), which is being called from pf_find_state() 
here:

1085 
1086 if (sk == NULL) {
1087 if ((sk = RB_FIND(pf_state_tree, _statetbl,
1088 (struct pf_state_key *)key)) == NULL)
1089 return (PF_DROP);
1090 if (pd->dir == PF_OUT && pkt_sk &&
1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir) == 
0)
1092 pf_state_key_link_reverse(sk, pkt_sk);
1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp &&
1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp)
1095 pf_state_key_link_inpcb(sk, 
pd->m->m_pkthdr.pf.inp);
1096 }
1097 

the story in human words goes as follows:

sk == NULL -> no matching state key was attached to packet, Thus we
have to search state key in state tree using RB_FIND()

if we could find state key for packet in table, then we will try
to set up a 'shortcut', which can save us RB_FIND() later.

1090 - 1092
the shortcut can be set up for outbound packet only (pd->dir PF_OUT),
which is also being forwarded (pkt_sk != NULL, indicates we are seeing
the packet for the second time pkt_sk holds state key for inbound
direction).  pf_compare_state_keys() is sanity check, it leaves a
debug message on system console on failure.

So if it is outbound forwarded packet, we've seen earlier, we
set up a reverse link to save one RB_FIND() operation on next
forwarded packet, which matches the same state.

1093 - 1095
creates similar shortcut for local bound packets. We put reference 
to state key into PCB linked to socket. This will save us RB_FIND()
operation for next local outbound, which matches the same state.


given the bug seems to be triggered/uncovered by pf_state_key_link_reverse()
is there any chance your laptop/workstation occasionally forwards packets?
like doing NAT for vmd/qemu virtual machine?

if it is not the case then the question is how does it come we run
pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line 1090.


> 
> I could try to run with your commit and see if I could trigger it more
> easily or found some elements influencing it. I could try with GENERIC
> for example to see if I still trigger the same assert() or if it is
> more like Olivier.

I need to think of how to further debug the thing.

> 
> my LAN was several hosts with the same kernel and only this machine
> trigger the panic, so it shouldn't be strictly linked to the
> environment.
> 

thanks a lot for your help (and patience)

regards
sashan



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Sebastien Marie
On Thu, May 06, 2021 at 04:50:31PM +0200, Alexandr Nedvedicky wrote:
> Hello Olivier,
> > 
> > This morning, I've rebuild a new --current kernel and got some panics after
> > some minutes with PF enabled.
> > Then I've applied your patch and it is stable so far.
> > 
> 
> thank you for your help with this. I have not heard back from Sebastien 
> yet.

my machine is stable with the commit reverted: up 1 day, 8:20.

> one more question:
>   are you building your bsd kernel with DIAGNOSTIC option enabled?
>   I think you don't, because your crash matches uvm fault due to
>   use-after-free.
> 
> Sebastien hit the problem earlier by KASSERT().
> 
> just to summarize there are two boxes so far, which choked up with my commit 
> [1].
> both boxes are quite different. yours runs bsd kernel on single core CPU:
> 
> cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 586-class) 
> 499 MHz, \
>   05-0a-02
> 
> Sebastien runs bsd.mp on two CPU cores:
> cpu0: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, 2660.30 MHz, 06-0f-0b
> 
> I'm not able to trigger crash on my HW. Which is notebbok running bsd.mp on:
> 
> cpu0: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz, 1496.74 MHz, 06-45-01
> 
> the other box, is APU router running bsd.mp on 4 cores:
> cpu0: AMD GX-412TC SOC, 998.26 MHz, 16-30-01
> 
> 
> to be honest I have no idea what could be causing problems on those two fairly
> distinct machines. The strange thing is that pf_test() currently does not run 
> in
> parallel. I don't quite understand why reverting my earlier change helps here.

it could be two differents ways to trigger a bug somewhere else that
your commit expose.

the panic doesn't trigger in the same way on both machines:
- Olivier's machine seems to trigger it quickly (after some minutes)
- mine relatively slowly (~ once a day)

I could try to run with your commit and see if I could trigger it more
easily or found some elements influencing it. I could try with GENERIC
for example to see if I still trigger the same assert() or if it is
more like Olivier.

my LAN was several hosts with the same kernel and only this machine
trigger the panic, so it shouldn't be strictly linked to the
environment.

Thanks.
-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Alexandr Nedvedicky
Hello Olivier,



> > in your case we've missed the assert and are dying on uvm fault.
> > 
> > you both seem to be using rdr-to. your pf seems to use also divert-to rule.
> > I suspect something is going wrong when we deal with traffic, which matches
> > rdr-to rule.
> > 
> > 
> > would you be so kind and try diff below on your AP box. The diff removes
> > my change to pf_state_key_link_reverse(). Which is a primary suspect
> > at the moment.
> > 
> > I'm not able to trigger the panic on my notebook, nor on my
> > home router.
> 
> This morning, I've rebuild a new --current kernel and got some panics after
> some minutes with PF enabled.
> Then I've applied your patch and it is stable so far.
> 

thank you for your help with this. I have not heard back from Sebastien yet.
one more question:
are you building your bsd kernel with DIAGNOSTIC option enabled?
I think you don't, because your crash matches uvm fault due to
use-after-free.

Sebastien hit the problem earlier by KASSERT().

just to summarize there are two boxes so far, which choked up with my commit 
[1].
both boxes are quite different. yours runs bsd kernel on single core CPU:

cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 586-class) 499 
MHz, \
05-0a-02

Sebastien runs bsd.mp on two CPU cores:
cpu0: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, 2660.30 MHz, 06-0f-0b 


   


I'm not able to trigger crash on my HW. Which is notebbok running bsd.mp on:

cpu0: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz, 1496.74 MHz, 06-45-01

the other box, is APU router running bsd.mp on 4 cores:
cpu0: AMD GX-412TC SOC, 998.26 MHz, 16-30-01


to be honest I have no idea what could be causing problems on those two fairly
distinct machines. The strange thing is that pf_test() currently does not run in
parallel. I don't quite understand why reverting my earlier change helps here.

sorry for inconveniences
regards
sashan

[1] https://marc.info/?l=openbsd-cvs=161951631726837=2



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-06 Thread Olivier Cherrier


Hello Alexandr,

On Wed, May 05, 2021 at 05:53:25PM +0200, alexandr.nedvedi...@oracle.com wrote:
> I've seen your report here
> 
> https://marc.info/?l=openbsd-bugs=161968896108810
> 
> your crash is slightly different. Sebastien is lucky enough
> to trip crash in assert, when state key is dereferenced.
 
OK.

> in your case we've missed the assert and are dying on uvm fault.
> 
> you both seem to be using rdr-to. your pf seems to use also divert-to rule.
> I suspect something is going wrong when we deal with traffic, which matches
> rdr-to rule.
> 
> 
> would you be so kind and try diff below on your AP box. The diff removes
> my change to pf_state_key_link_reverse(). Which is a primary suspect
> at the moment.
> 
> I'm not able to trigger the panic on my notebook, nor on my
> home router.

This morning, I've rebuild a new --current kernel and got some panics after
some minutes with PF enabled.
Then I've applied your patch and it is stable so far.

Thanks,
Best.

-- 
Olivier Cherrier
Phone: +352691570680
mailto:o...@symacx.com



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Alexandr Nedvedicky
Hello Olivier,

I've seen your report here

https://marc.info/?l=openbsd-bugs=161968896108810

your crash is slightly different. Sebastien is lucky enough
to trip crash in assert, when state key is dereferenced.

in your case we've missed the assert and are dying on uvm fault.

you both seem to be using rdr-to. your pf seems to use also divert-to rule.
I suspect something is going wrong when we deal with traffic, which matches
rdr-to rule.


would you be so kind and try diff below on your AP box. The diff removes
my change to pf_state_key_link_reverse(). Which is a primary suspect
at the moment.

I'm not able to trigger the panic on my notebook, nor on my
home router.

thank you for your help
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 23eebf4a274..12d05976f0b 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -7368,19 +7368,11 @@ pf_inp_unlink(struct inpcb *inp)
 void
 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev)
 {
-   struct pf_state_key *old_reverse;
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, skrev);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == skrev);
-   else
-   pf_state_key_ref(skrev);
-
-   old_reverse = atomic_cas_ptr(>reverse, NULL, sk);
-   if (old_reverse != NULL)
-   KASSERT(old_reverse == sk);
-   else
-   pf_state_key_ref(sk);
+   /* Note that sk and skrev may be equal, then we refcount twice. */
+   KASSERT(sk->reverse == NULL);
+   KASSERT(skrev->reverse == NULL);
+   sk->reverse = pf_state_key_ref(skrev);
+   skrev->reverse = pf_state_key_ref(sk);
 }
 
 #if NPFLOG > 0



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Olivier Cherrier
On Tue, May 04, 2021 at 12:26:17PM +0200, sema...@online.fr wrote:
> Date: Tue, 4 May 2021 12:26:17 +0200
> From: Sebastien Marie 
> To: Alexandr Nedvedicky 
> Cc: bugs@openbsd.org
> Subject: Re: [External] : pf_state_key_unref: panic: kernel diagnostic
>  assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c",
>  line 826
> 
> On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote:
> > Hello Sebastien,
> > 
> > On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote:
> > > Hi,
> > > 
> > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD 
> > > 6.9-current (GENERIC.MP) #492: Sat May  1 17:37:28 MDT 2021).
> 
> Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 
> 16:08:13 MDT 2021
 

It looks to be similar than this one:
https://marc.info/?l=openbsd-bugs=161968896108810


-- 
Olivier Cherrier
Phone: +352691570680
mailto:o...@symacx.com



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-05 Thread Sebastien Marie
On Tue, May 04, 2021 at 04:50:06PM +0200, Sebastien Marie wrote:
> On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote:
> > Hello Sebastien,
> > 
> > thank you for additional info about previously working kernel.
> > 
> > it looks like your older kernel, which works, might be running without
> > my commit
> > 
> > revision 1.1116
> > date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
> >  lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
> > pf_state_key_link_reverse() is prone to race on parallel forwarding
> > 
> > we need to adjust assertions. at time we call 
> > pf_state_key_link_reverse()
> > is state_key either linked to correct reverse peer or not linked at all.
> > The pf_state_key_link_reverse() is being called as a reader ons 
> > tate_lock.
> > There might be more packets, which try to update the state key.
> > 
> > OK bluhm@
> > 
> > > # cat /etc/pf.conf
> > > # See pf.conf(5) and /etc/examples/pf.conf
> > > # 
> > > # avoid skip on lo to get tryton redir
> > > #set skip on lo
> > > 
> > > block return# block stateless traffic
> > > pass# establish keep-state
> > > 
> > > # redir 80 -> 8000
> > > pass in proto tcp to (self) port 80 rdr-to 
> > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> > > 
> > 
> > the rule above is also interesting bit. this may trigger NAT-64.
> > I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
> > to try to reproduce the crash.
> 
> I will first try with another rule:
> 
>   pass in inet6 proto tcp to (self) port 80 rdr-to 
> 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> 
> to avoid the possible NAT-64.

the assert still trigger.

> And next (if I still trigger the assert) I will rebuild a kernel and
> backout this specific commit.

trying with a kernel without this commit

-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Alexandr Nedvedicky
On Tue, May 04, 2021 at 05:01:35PM +0200, Florian Obser wrote:
> On 2021-05-04 16:50 +02, Sebastien Marie  wrote:
> > On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote:
> >> Hello Sebastien,
> >> 
> >> thank you for additional info about previously working kernel.
> >> 
> >> it looks like your older kernel, which works, might be running without
> >> my commit
> >> 
> >> revision 1.1116
> >> date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
> >> lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
> >> pf_state_key_link_reverse() is prone to race on parallel forwarding
> >> 
> >> we need to adjust assertions. at time we call 
> >> pf_state_key_link_reverse()
> >> is state_key either linked to correct reverse peer or not linked at 
> >> all.
> >> The pf_state_key_link_reverse() is being called as a reader ons 
> >> tate_lock.
> >> There might be more packets, which try to update the state key.
> >> 
> >> OK bluhm@
> >> 
> >> > # cat /etc/pf.conf
> >> > # See pf.conf(5) and /etc/examples/pf.conf
> >> > # 
> >> > # avoid skip on lo to get tryton redir
> >> > #set skip on lo
> >> > 
> >> > block return# block stateless traffic
> >> > pass# establish keep-state
> >> > 
> >> > # redir 80 -> 8000
> >> > pass in proto tcp to (self) port 80 rdr-to
> >> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> >> > 
> >> 
> >> the rule above is also interesting bit. this may trigger NAT-64.
> >> I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
> >> to try to reproduce the crash.
> >
> > I will first try with another rule:
> >
> >   pass in inet6 proto tcp to (self) port 80 rdr-to
> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> >
> > to avoid the possible NAT-64.
> 
> fwiw I intentionally run NAT64 and I don't see this.
> But I only have one client, my android phone.
> 
> kern.version=OpenBSD 6.9-current (GENERIC.MP) #559: Fri Apr 30 03:31:48 MDT 
> 2021
> 
> 
> set skip on lo
> block return  # block stateless traffic
> pass  # establish keep-state
> pass in log on vlan64 inet6 from any to 64:ff9b::/96 af-to inet \
>   from 192.168.178.23
> 
> 

indeed, NAT and NAT-64 are essentially same from state table point of view.
it looks like the issue is rather triggered by local-bound traffic, which
matches rdr-to rule. We might be missing a reference when linking a state
key to pcb bound to local socket. This is where I'm currently looking at.


thanks and
regards
sashan



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Florian Obser
On 2021-05-04 16:50 +02, Sebastien Marie  wrote:
> On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote:
>> Hello Sebastien,
>> 
>> thank you for additional info about previously working kernel.
>> 
>> it looks like your older kernel, which works, might be running without
>> my commit
>> 
>> revision 1.1116
>> date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
>>   lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
>> pf_state_key_link_reverse() is prone to race on parallel forwarding
>> 
>> we need to adjust assertions. at time we call pf_state_key_link_reverse()
>> is state_key either linked to correct reverse peer or not linked at all.
>> The pf_state_key_link_reverse() is being called as a reader ons 
>> tate_lock.
>> There might be more packets, which try to update the state key.
>> 
>> OK bluhm@
>> 
>> > # cat /etc/pf.conf
>> > # See pf.conf(5) and /etc/examples/pf.conf
>> > # 
>> > # avoid skip on lo to get tryton redir
>> > #set skip on lo
>> > 
>> > block return# block stateless traffic
>> > pass# establish keep-state
>> > 
>> > # redir 80 -> 8000
>> > pass in proto tcp to (self) port 80 rdr-to
>> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
>> > 
>> 
>> the rule above is also interesting bit. this may trigger NAT-64.
>> I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
>> to try to reproduce the crash.
>
> I will first try with another rule:
>
>   pass in inet6 proto tcp to (self) port 80 rdr-to
> 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
>
> to avoid the possible NAT-64.

fwiw I intentionally run NAT64 and I don't see this.
But I only have one client, my android phone.

kern.version=OpenBSD 6.9-current (GENERIC.MP) #559: Fri Apr 30 03:31:48 MDT 2021


set skip on lo
block return# block stateless traffic
pass# establish keep-state
pass in log on vlan64 inet6 from any to 64:ff9b::/96 af-to inet \
from 192.168.178.23


>
> And next (if I still trigger the assert) I will rebuild a kernel and
> backout this specific commit.
>
> It should permit to see if the commit is the problem or not.
>
> Thanks.

-- 
I'm not entirely sure you are real.



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Sebastien Marie
On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote:
> Hello Sebastien,
> 
> thank you for additional info about previously working kernel.
> 
> it looks like your older kernel, which works, might be running without
> my commit
> 
> revision 1.1116
> date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
>lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
> pf_state_key_link_reverse() is prone to race on parallel forwarding
> 
> we need to adjust assertions. at time we call pf_state_key_link_reverse()
> is state_key either linked to correct reverse peer or not linked at all.
> The pf_state_key_link_reverse() is being called as a reader ons tate_lock.
> There might be more packets, which try to update the state key.
> 
> OK bluhm@
> 
> > # cat /etc/pf.conf
> > # See pf.conf(5) and /etc/examples/pf.conf
> > # 
> > # avoid skip on lo to get tryton redir
> > #set skip on lo
> > 
> > block return# block stateless traffic
> > pass# establish keep-state
> > 
> > # redir 80 -> 8000
> > pass in proto tcp to (self) port 80 rdr-to 
> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> > 
> 
> the rule above is also interesting bit. this may trigger NAT-64.
> I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
> to try to reproduce the crash.

I will first try with another rule:

  pass in inet6 proto tcp to (self) port 80 rdr-to 
2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000

to avoid the possible NAT-64.

And next (if I still trigger the assert) I will rebuild a kernel and
backout this specific commit.

It should permit to see if the commit is the problem or not.

Thanks.
-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Alexandr Nedvedicky
Hello Sebastien,

thank you for additional info about previously working kernel.

it looks like your older kernel, which works, might be running without
my commit

revision 1.1116
date: 2021/04/27 09:38:29;  author: sashan;  state: Exp;\
 lines: +14 -6;  commitid: 3W1fRTkLb3ZlUanF;
pf_state_key_link_reverse() is prone to race on parallel forwarding

we need to adjust assertions. at time we call pf_state_key_link_reverse()
is state_key either linked to correct reverse peer or not linked at all.
The pf_state_key_link_reverse() is being called as a reader ons tate_lock.
There might be more packets, which try to update the state key.

OK bluhm@

> # cat /etc/pf.conf
> # See pf.conf(5) and /etc/examples/pf.conf
> # 
> # avoid skip on lo to get tryton redir
> #set skip on lo
> 
> block return# block stateless traffic
> pass# establish keep-state
> 
> # redir 80 -> 8000
> pass in proto tcp to (self) port 80 rdr-to 
> 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000
> 

the rule above is also interesting bit. this may trigger NAT-64.
I'm not using IPv6 on my hosts/routers. I'll try to use similar rule
to try to reproduce the crash.


thanks and
regards
sashan



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Sebastien Marie
On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote:
> Hello Sebastien,
> 
> On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote:
> > Hi,
> > 
> > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD 
> > 6.9-current (GENERIC.MP) #492: Sat May  1 17:37:28 MDT 2021).

Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 
16:08:13 MDT 2021

> > panic: kernel diagnostic assertion "refcnt != ~0" failed: file 
> > "/usr/src/sys/kern/kern_synch.c", line 826
> > Starting stack trace...
> > panic(81dfbc8e) at panic+0x11d
> > __assert(81e64b54,81e0a6ee,33a,81e03b7f) at 
> > __assert+0x2b
> > refcnt_rele(fd810bf02458) at refcnt_rele+0x6f
> > pf_state_key_unref(fd810bf023f0) at pf_state_key_unref+0x21
> > pf_remove_state(fd810c0c4578) at pf_remove_state+0x1fa
> > pf_purge_expired_states(2) at pf_purge_expired_states+0x232
> > pf_purge(82236a30) at pf_purge+0x33
> > taskq_thread(80032080) at taskq_thread+0x81
> > end trace frame: 0x0, count: 249
> > End of stack trace.
> > 
> > The machine is running with ddb.panic=0, and hangs before rebooting (I
> > assume it hangs during dump). I just changed the config to keep in
> > ddb, and I should be able to get additionnal informations next time.
> > 
> > 
> > As it is panicing in pf related code, some informations on the
> > network: the machine has two networks cards (sis0 and em0) bridged,
> > and a vlan (in rdomain).
> > 
> > em0 is egress, and sis0 has only a network printer behind it.
> > 
> > It panics whereas there is no particular interaction with the host,
> > but time of panic is on state removal so I assume after a timeout.
> > 
> 
> 
> The assert is being tripped out of sudden? or are you also modifying state
> table (pfctl -Fs). or something else is happening forced reboot/power 
> off...

nothing that I know. The assert seems be raised suddenly. But
according to the stacktrace, it occured due to call to
pf_purge_expired_states(), it is why I mentioned timeout.

thanks.
-- 
Sebastien Marie



Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826

2021-05-04 Thread Alexandr Nedvedicky
Hello Sebastien,

On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote:
> Hi,
> 
> Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD 
> 6.9-current (GENERIC.MP) #492: Sat May  1 17:37:28 MDT 2021).
> 
> panic: kernel diagnostic assertion "refcnt != ~0" failed: file 
> "/usr/src/sys/kern/kern_synch.c", line 826
> Starting stack trace...
> panic(81dfbc8e) at panic+0x11d
> __assert(81e64b54,81e0a6ee,33a,81e03b7f) at 
> __assert+0x2b
> refcnt_rele(fd810bf02458) at refcnt_rele+0x6f
> pf_state_key_unref(fd810bf023f0) at pf_state_key_unref+0x21
> pf_remove_state(fd810c0c4578) at pf_remove_state+0x1fa
> pf_purge_expired_states(2) at pf_purge_expired_states+0x232
> pf_purge(82236a30) at pf_purge+0x33
> taskq_thread(80032080) at taskq_thread+0x81
> end trace frame: 0x0, count: 249
> End of stack trace.
> 
> The machine is running with ddb.panic=0, and hangs before rebooting (I
> assume it hangs during dump). I just changed the config to keep in
> ddb, and I should be able to get additionnal informations next time.
> 
> 
> As it is panicing in pf related code, some informations on the
> network: the machine has two networks cards (sis0 and em0) bridged,
> and a vlan (in rdomain).
> 
> em0 is egress, and sis0 has only a network printer behind it.
> 
> It panics whereas there is no particular interaction with the host,
> but time of panic is on state removal so I assume after a timeout.
> 


The assert is being tripped out of sudden? or are you also modifying state
table (pfctl -Fs). or something else is happening forced reboot/power off...

thanks and
regards
sashan