Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Sebastien, > > thank you very much for doing the test for me. > > patch below should kill the bug. > > > > patch applies to current. > > I confirm that since 4 days the host is stable without any issues. > > Your diff is ok semarie@ > thank you for OK. unfortunately one more tweak is needed. The change you were testing allows race, where we leak a reference, when two packets are trying to bind state keys with each other at the same time. I've found the leak on test box provided by Hrvoje, where I'm able to put the code under the stress. this got fixed by diff below. I believe patch below should keep your box happy and reference leak plugged. I would like to ask you if you can put it in and give it a try. thanks and regards sashan 8<---8<---8<--8< diff --git a/sys/net/pf.c b/sys/net/pf.c index 23eebf4a274..bcd81a83300 100644 --- a/sys/net/pf.c +++ b/sys/net/pf.c @@ -7373,14 +7373,21 @@ pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) old_reverse = atomic_cas_ptr(>reverse, NULL, skrev); if (old_reverse != NULL) KASSERT(old_reverse == skrev); - else + else { pf_state_key_ref(skrev); - old_reverse = atomic_cas_ptr(>reverse, NULL, sk); - if (old_reverse != NULL) - KASSERT(old_reverse == sk); - else + /* +* NOTE: if sk == skrev, then KASSERT() below holds true, we +* still want to grab a reference in such case, because +* pf_state_key_unlink_reverse() does not check whether keys +* are identical or not. +*/ + old_reverse = atomic_cas_ptr(>reverse, NULL, sk); + if (old_reverse != NULL) + KASSERT(old_reverse == sk); + pf_state_key_ref(sk); + } } #if NPFLOG > 0
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Sat, May 08, 2021 at 12:31:50AM +0200, Alexandr Nedvedicky wrote: > Hello, > > thank you very much for doing the test for me. > patch below should kill the bug. > > patch applies to current. I confirm that since 4 days the host is stable without any issues. Your diff is ok semarie@ Thanks ! > > 8<---8<---8<--8< > diff --git a/sys/net/pf.c b/sys/net/pf.c > index 23eebf4a274..275ef8a6157 100644 > --- a/sys/net/pf.c > +++ b/sys/net/pf.c > @@ -7376,11 +7376,15 @@ pf_state_key_link_reverse(struct pf_state_key *sk, > struct pf_state_key *skrev) > else > pf_state_key_ref(skrev); > > + /* > + * NOTE: if sk == skrev, then KASSERT() below holds true. We still > + * want to grab a reference in such case. > + */ > old_reverse = atomic_cas_ptr(>reverse, NULL, sk); > if (old_reverse != NULL) > KASSERT(old_reverse == sk); > - else > - pf_state_key_ref(sk); > + > + pf_state_key_ref(sk); > } > > #if NPFLOG > 0 -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Sat, May 08, 2021 at 12:31:50AM +0200, alexandr.nedvedi...@oracle.com wrote: > Hello, > > thank you very much for doing the test for me. > patch below should kill the bug. > > patch applies to current. > > thanks and > regards > sashan Hello Alexandr, I'll give it a try :) In the meantime, I got a similar crash than Sebastien: ddb> show panic kernel diagnostic assertion "sk != skrev" failed: file "/sys/net/pf.c", line 7372 ddb> trace db_enter() at db_enter+0x4 panic(d0bb8d61) at panic+0xd3 __assert(d0c1f6ee,d0c06097,1ccc,d0bcb680) at __assert+0x19 pf_state_key_link_reverse(d16fddd0,d16fddd0) at pf_state_key_link_reverse+0x6e pf_find_state(f397efe4,f397ef6c,f397efd4) at pf_find_state+0x1b9 pf_test_state(f397efe4,f397efd4,f397efe2,0) at pf_test_state+0x10e pf_test(2,2,d19ea030,f397f0ec) at pf_test+0xcb4 bridge_ip(d1d61000,2,d19ea030,f397f12c,d1d7ea00) at bridge_ip+0x43b bridge_broadcast(d1d61000,d1d55800,f397f12c,d1d7ea00) at bridge_broadcast+0x191 bridgeintr_frame(d1d61000,d1d55800,d1d7ea00) at bridgeintr_frame+0x2cc bridge_process(d1d55800,d1d7ea00) at bridge_process+0x2be bridgeintr() at bridgeintr+0x70 if_netisr(0) at if_netisr+0xc3 taskq_thread(d19e1040) at taskq_thread+0x51 ddb> Thanks, Best. -- Olivier Cherrier Phone: +352691570680 mailto:o...@symacx.com
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Fri, May 07, 2021 at 02:16:35AM +0200, Alexandr Nedvedicky wrote: > > I'd like to ask you for yet another brave test. just to verify > the story I dream of above really happens. The plan is to put > yet another KASSERT() to pf_state_key_link_reverse(): > > KASSERT(sk != skrev); > > I expect the KASSERT will fire sooner or later. I'm not sure > which packet could trigger such condition (sk == skrev), I > suspect this could be kind of multicast/broadcast packet. > Hence I'd like to ask you to give a try KASSSERT() above. > panic: kernel diagnostic assertion "sk != skrev" failed: file "/usr/src/sys/net/pf.c", line 7372 Stopped at db_enter+0x10: popq%rbp TIDPIDUID PRFLAGS PFLAGS CPU COMMAND * 51803 77372 0 0x14000 0x2001K softnet db_enter() at db_enter+0x10 panic(81e05c77) at panic+0x12a __assert(81e706d5,81de2d6c,1ccc,81e1a122) at __assert+0x2b pf_state_key_link_reverse(fd81091c1000,fd81091c1000) at pf_state_key_link_reverse+0x8f pf_find_state(80002246e240,80002246e178,80002246e330) at pf_find_state+0x204 pf_test_state(80002246e240,80002246e330,80002246e33e,0) at pf_test_state+0x178 pf_test(2,2,800bb050,80002246e3c8) at pf_test+0xcec bridge_ip(80a53000,2,800bb050,80002246e4c0,fd80be0b6000) at bridge_ip+0x609 bridge_broadcast(80a53000,80099048,80002246e4c0,fd80cf52f300) at bridge_broadcast+0x1a2 bridgeintr_frame(80a53000,80099048,fd80cf52f300) at bridgeintr_frame+0x2fa bridge_process(80099048,fd80cf52f300) at bridge_process+0x2d6bridgeintr() at bridgeintr+0x84 if_netisr(0) at if_netisr+0xe9 taskq_thread(80033000) at taskq_thread+0x9f end trace frame: 0x0, count: 1 https://www.openbsd.org/ddb.html describes the minimum info required in bug reports. Insufficient info makes it difficult to find and fix bugs. ddb{1}> Thanks. -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Olivier and Sebastien, I took a look at old version of pf_state_key_link_reverse(), before my commit [1] changed it. The answer was there: 7368 void 7369 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) 7370 { 7371 /* Note that sk and skrev may be equal, then we refcount twice. */ 7372 KASSERT(sk != skrev); 7373 KASSERT(sk->reverse == NULL); 7374 KASSERT(skrev->reverse == NULL); 7375 sk->reverse = pf_state_key_ref(skrev); 7376 skrev->reverse = pf_state_key_ref(sk); 7377 } comment at line 7371 says the skrev and sk may be equal. Taking a look at current pf_state_key_link_reverse(), which is in the tree right now, behaves differently to old function when sk and skrev are identical: 7368 void 7369 pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) 7370 { 7371 struct pf_state_key *old_reverse; 7372 7373 old_reverse = atomic_cas_ptr(>reverse, NULL, skrev); 7374 if (old_reverse != NULL) 7375 KASSERT(old_reverse == skrev); 7376 else 7377 pf_state_key_ref(skrev); 7378 7379 old_reverse = atomic_cas_ptr(>reverse, NULL, sk); 7380 if (old_reverse != NULL) 7381 KASSERT(old_reverse == sk); 7382 else 7383 pf_state_key_ref(sk); 7384 } note that if both keys are identical, we are going to grab just single reference. The story goes like that: we assume sk == skrev sk->reverse == NULL when we do atomic_cas() at line 7373 the cas sets sk->reverse to skrev, which is equal to sk the old_reverse is NULL when we reach 7374, therefore we grab reference to skrev at line 7379 we do cas_ptr on skrev->reverse. because sk == skrev, the skrev->reverse is not NULL. the old_reverse is not NULL when we reach 7380, old_reverse equals to skrev (and sk) old_reverse == skrev == sk in this case we do KASSERT(old_reverse == sk) at line 7381, the assertion holds so we just return from the function without getting extra reference. this is wrong. I'd like to ask you for yet another brave test. just to verify the story I dream of above really happens. The plan is to put yet another KASSERT() to pf_state_key_link_reverse(): KASSERT(sk != skrev); I expect the KASSERT will fire sooner or later. I'm not sure which packet could trigger such condition (sk == skrev), I suspect this could be kind of multicast/broadcast packet. Hence I'd like to ask you to give a try KASSSERT() above. diff below is against current tree. It adds desired assert to your pf_state_key_link_reverse(). thank you very much for your help. regards sashan [1] https://marc.info/?l=openbsd-cvs=161951631726837=2 8<---8<---8<--8< diff --git a/sys/net/pf.c b/sys/net/pf.c index 23eebf4a274..298568dec28 100644 --- a/sys/net/pf.c +++ b/sys/net/pf.c @@ -7368,19 +7368,12 @@ pf_inp_unlink(struct inpcb *inp) void pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) { - struct pf_state_key *old_reverse; - - old_reverse = atomic_cas_ptr(>reverse, NULL, skrev); - if (old_reverse != NULL) - KASSERT(old_reverse == skrev); - else - pf_state_key_ref(skrev); - - old_reverse = atomic_cas_ptr(>reverse, NULL, sk); - if (old_reverse != NULL) - KASSERT(old_reverse == sk); - else - pf_state_key_ref(sk); + /* Note that sk and skrev may be equal, then we refcount twice. */ + KASSERT(sk != skrev); + KASSERT(sk->reverse == NULL); + KASSERT(skrev->reverse == NULL); + sk->reverse = pf_state_key_ref(skrev); + skrev->reverse = pf_state_key_ref(sk); } #if NPFLOG > 0
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Thu, May 06, 2021 at 04:50:31PM +0200, alexandr.nedvedi...@oracle.com wrote: > thank you for your help with this. I have not heard back from Sebastien > yet. > one more question: > are you building your bsd kernel with DIAGNOSTIC option enabled? It is the GENERIC kernel. So with DIAGNOSTIC: $ grep DIAG /sys/conf/GENERIC option DIAGNOSTIC # internal consistency checks $ Thanks, Best. -- Olivier Cherrier Phone: +352691570680 mailto:o...@symacx.com
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Thu, May 06, 2021 at 06:10:39PM +0200, Alexandr Nedvedicky wrote: > Hello, > > > > > to be honest I have no idea what could be causing problems on those two > > > fairly > > > distinct machines. The strange thing is that pf_test() currently does not > > > run in > > > parallel. I don't quite understand why reverting my earlier change helps > > > here. > > > > it could be two differents ways to trigger a bug somewhere else that > > your commit expose. > > > > the panic doesn't trigger in the same way on both machines: > > - Olivier's machine seems to trigger it quickly (after some minutes) > > - mine relatively slowly (~ once a day) > > Olivier's machine acts as AP, so it forwards packets between interfaces. > > If I remember correctly your machine is laptop/workstation, which > does not forward traffic. it doesn't forward, but it acts as a bridge: 2 physical networks cards grouped in a bridge(4) with only few traffic (a network printer is on other side, the bridge(4) is here because I had a sparse network card and no physical-switch to put here). > the function, which we change back and forth here is > pf_state_key_link_reverse(), which is being called from pf_find_state() > here: > > 1085 > 1086 if (sk == NULL) { > 1087 if ((sk = RB_FIND(pf_state_tree, _statetbl, > 1088 (struct pf_state_key *)key)) == NULL) > 1089 return (PF_DROP); > 1090 if (pd->dir == PF_OUT && pkt_sk && > 1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir) > == 0) > 1092 pf_state_key_link_reverse(sk, pkt_sk); > 1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp && > 1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp) > 1095 pf_state_key_link_inpcb(sk, > pd->m->m_pkthdr.pf.inp); > 1096 } > 1097 > > the story in human words goes as follows: > > sk == NULL -> no matching state key was attached to packet, Thus we > have to search state key in state tree using RB_FIND() > > if we could find state key for packet in table, then we will try > to set up a 'shortcut', which can save us RB_FIND() later. > > 1090 - 1092 > the shortcut can be set up for outbound packet only (pd->dir PF_OUT), > which is also being forwarded (pkt_sk != NULL, indicates we are seeing > the packet for the second time pkt_sk holds state key for inbound > direction). pf_compare_state_keys() is sanity check, it leaves a > debug message on system console on failure. > > So if it is outbound forwarded packet, we've seen earlier, we > set up a reverse link to save one RB_FIND() operation on next > forwarded packet, which matches the same state. > > 1093 - 1095 > creates similar shortcut for local bound packets. We put reference > to state key into PCB linked to socket. This will save us RB_FIND() > operation for next local outbound, which matches the same state. > > > given the bug seems to be triggered/uncovered by pf_state_key_link_reverse() > is there any chance your laptop/workstation occasionally forwards packets? > like doing NAT for vmd/qemu virtual machine? I don't know bridge(4) internals, but it could make sense that it is using such functions. > if it is not the case then the question is how does it come we run > pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line > 1090. > > > > > I could try to run with your commit and see if I could trigger it more > > easily or found some elements influencing it. I could try with GENERIC > > for example to see if I still trigger the same assert() or if it is > > more like Olivier. > > I need to think of how to further debug the thing. > > > > > my LAN was several hosts with the same kernel and only this machine > > trigger the panic, so it shouldn't be strictly linked to the > > environment. > > Thanks -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello, > > to be honest I have no idea what could be causing problems on those two > > fairly > > distinct machines. The strange thing is that pf_test() currently does not > > run in > > parallel. I don't quite understand why reverting my earlier change helps > > here. > > it could be two differents ways to trigger a bug somewhere else that > your commit expose. > > the panic doesn't trigger in the same way on both machines: > - Olivier's machine seems to trigger it quickly (after some minutes) > - mine relatively slowly (~ once a day) Olivier's machine acts as AP, so it forwards packets between interfaces. If I remember correctly your machine is laptop/workstation, which does not forward traffic. the function, which we change back and forth here is pf_state_key_link_reverse(), which is being called from pf_find_state() here: 1085 1086 if (sk == NULL) { 1087 if ((sk = RB_FIND(pf_state_tree, _statetbl, 1088 (struct pf_state_key *)key)) == NULL) 1089 return (PF_DROP); 1090 if (pd->dir == PF_OUT && pkt_sk && 1091 pf_compare_state_keys(pkt_sk, sk, pd->kif, pd->dir) == 0) 1092 pf_state_key_link_reverse(sk, pkt_sk); 1093 else if (pd->dir == PF_OUT && pd->m->m_pkthdr.pf.inp && 1094 !pd->m->m_pkthdr.pf.inp->inp_pf_sk && !sk->inp) 1095 pf_state_key_link_inpcb(sk, pd->m->m_pkthdr.pf.inp); 1096 } 1097 the story in human words goes as follows: sk == NULL -> no matching state key was attached to packet, Thus we have to search state key in state tree using RB_FIND() if we could find state key for packet in table, then we will try to set up a 'shortcut', which can save us RB_FIND() later. 1090 - 1092 the shortcut can be set up for outbound packet only (pd->dir PF_OUT), which is also being forwarded (pkt_sk != NULL, indicates we are seeing the packet for the second time pkt_sk holds state key for inbound direction). pf_compare_state_keys() is sanity check, it leaves a debug message on system console on failure. So if it is outbound forwarded packet, we've seen earlier, we set up a reverse link to save one RB_FIND() operation on next forwarded packet, which matches the same state. 1093 - 1095 creates similar shortcut for local bound packets. We put reference to state key into PCB linked to socket. This will save us RB_FIND() operation for next local outbound, which matches the same state. given the bug seems to be triggered/uncovered by pf_state_key_link_reverse() is there any chance your laptop/workstation occasionally forwards packets? like doing NAT for vmd/qemu virtual machine? if it is not the case then the question is how does it come we run pf_state_key_link_reverse()? which same as why pkt_sk is not NULL at line 1090. > > I could try to run with your commit and see if I could trigger it more > easily or found some elements influencing it. I could try with GENERIC > for example to see if I still trigger the same assert() or if it is > more like Olivier. I need to think of how to further debug the thing. > > my LAN was several hosts with the same kernel and only this machine > trigger the panic, so it shouldn't be strictly linked to the > environment. > thanks a lot for your help (and patience) regards sashan
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Thu, May 06, 2021 at 04:50:31PM +0200, Alexandr Nedvedicky wrote: > Hello Olivier, > > > > This morning, I've rebuild a new --current kernel and got some panics after > > some minutes with PF enabled. > > Then I've applied your patch and it is stable so far. > > > > thank you for your help with this. I have not heard back from Sebastien > yet. my machine is stable with the commit reverted: up 1 day, 8:20. > one more question: > are you building your bsd kernel with DIAGNOSTIC option enabled? > I think you don't, because your crash matches uvm fault due to > use-after-free. > > Sebastien hit the problem earlier by KASSERT(). > > just to summarize there are two boxes so far, which choked up with my commit > [1]. > both boxes are quite different. yours runs bsd kernel on single core CPU: > > cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 586-class) > 499 MHz, \ > 05-0a-02 > > Sebastien runs bsd.mp on two CPU cores: > cpu0: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, 2660.30 MHz, 06-0f-0b > > I'm not able to trigger crash on my HW. Which is notebbok running bsd.mp on: > > cpu0: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz, 1496.74 MHz, 06-45-01 > > the other box, is APU router running bsd.mp on 4 cores: > cpu0: AMD GX-412TC SOC, 998.26 MHz, 16-30-01 > > > to be honest I have no idea what could be causing problems on those two fairly > distinct machines. The strange thing is that pf_test() currently does not run > in > parallel. I don't quite understand why reverting my earlier change helps here. it could be two differents ways to trigger a bug somewhere else that your commit expose. the panic doesn't trigger in the same way on both machines: - Olivier's machine seems to trigger it quickly (after some minutes) - mine relatively slowly (~ once a day) I could try to run with your commit and see if I could trigger it more easily or found some elements influencing it. I could try with GENERIC for example to see if I still trigger the same assert() or if it is more like Olivier. my LAN was several hosts with the same kernel and only this machine trigger the panic, so it shouldn't be strictly linked to the environment. Thanks. -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Olivier, > > in your case we've missed the assert and are dying on uvm fault. > > > > you both seem to be using rdr-to. your pf seems to use also divert-to rule. > > I suspect something is going wrong when we deal with traffic, which matches > > rdr-to rule. > > > > > > would you be so kind and try diff below on your AP box. The diff removes > > my change to pf_state_key_link_reverse(). Which is a primary suspect > > at the moment. > > > > I'm not able to trigger the panic on my notebook, nor on my > > home router. > > This morning, I've rebuild a new --current kernel and got some panics after > some minutes with PF enabled. > Then I've applied your patch and it is stable so far. > thank you for your help with this. I have not heard back from Sebastien yet. one more question: are you building your bsd kernel with DIAGNOSTIC option enabled? I think you don't, because your crash matches uvm fault due to use-after-free. Sebastien hit the problem earlier by KASSERT(). just to summarize there are two boxes so far, which choked up with my commit [1]. both boxes are quite different. yours runs bsd kernel on single core CPU: cpu0: Geode(TM) Integrated Processor by AMD PCS ("AuthenticAMD" 586-class) 499 MHz, \ 05-0a-02 Sebastien runs bsd.mp on two CPU cores: cpu0: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, 2660.30 MHz, 06-0f-0b I'm not able to trigger crash on my HW. Which is notebbok running bsd.mp on: cpu0: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz, 1496.74 MHz, 06-45-01 the other box, is APU router running bsd.mp on 4 cores: cpu0: AMD GX-412TC SOC, 998.26 MHz, 16-30-01 to be honest I have no idea what could be causing problems on those two fairly distinct machines. The strange thing is that pf_test() currently does not run in parallel. I don't quite understand why reverting my earlier change helps here. sorry for inconveniences regards sashan [1] https://marc.info/?l=openbsd-cvs=161951631726837=2
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Alexandr, On Wed, May 05, 2021 at 05:53:25PM +0200, alexandr.nedvedi...@oracle.com wrote: > I've seen your report here > > https://marc.info/?l=openbsd-bugs=161968896108810 > > your crash is slightly different. Sebastien is lucky enough > to trip crash in assert, when state key is dereferenced. OK. > in your case we've missed the assert and are dying on uvm fault. > > you both seem to be using rdr-to. your pf seems to use also divert-to rule. > I suspect something is going wrong when we deal with traffic, which matches > rdr-to rule. > > > would you be so kind and try diff below on your AP box. The diff removes > my change to pf_state_key_link_reverse(). Which is a primary suspect > at the moment. > > I'm not able to trigger the panic on my notebook, nor on my > home router. This morning, I've rebuild a new --current kernel and got some panics after some minutes with PF enabled. Then I've applied your patch and it is stable so far. Thanks, Best. -- Olivier Cherrier Phone: +352691570680 mailto:o...@symacx.com
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Olivier, I've seen your report here https://marc.info/?l=openbsd-bugs=161968896108810 your crash is slightly different. Sebastien is lucky enough to trip crash in assert, when state key is dereferenced. in your case we've missed the assert and are dying on uvm fault. you both seem to be using rdr-to. your pf seems to use also divert-to rule. I suspect something is going wrong when we deal with traffic, which matches rdr-to rule. would you be so kind and try diff below on your AP box. The diff removes my change to pf_state_key_link_reverse(). Which is a primary suspect at the moment. I'm not able to trigger the panic on my notebook, nor on my home router. thank you for your help regards sashan 8<---8<---8<--8< diff --git a/sys/net/pf.c b/sys/net/pf.c index 23eebf4a274..12d05976f0b 100644 --- a/sys/net/pf.c +++ b/sys/net/pf.c @@ -7368,19 +7368,11 @@ pf_inp_unlink(struct inpcb *inp) void pf_state_key_link_reverse(struct pf_state_key *sk, struct pf_state_key *skrev) { - struct pf_state_key *old_reverse; - - old_reverse = atomic_cas_ptr(>reverse, NULL, skrev); - if (old_reverse != NULL) - KASSERT(old_reverse == skrev); - else - pf_state_key_ref(skrev); - - old_reverse = atomic_cas_ptr(>reverse, NULL, sk); - if (old_reverse != NULL) - KASSERT(old_reverse == sk); - else - pf_state_key_ref(sk); + /* Note that sk and skrev may be equal, then we refcount twice. */ + KASSERT(sk->reverse == NULL); + KASSERT(skrev->reverse == NULL); + sk->reverse = pf_state_key_ref(skrev); + skrev->reverse = pf_state_key_ref(sk); } #if NPFLOG > 0
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 12:26:17PM +0200, sema...@online.fr wrote: > Date: Tue, 4 May 2021 12:26:17 +0200 > From: Sebastien Marie > To: Alexandr Nedvedicky > Cc: bugs@openbsd.org > Subject: Re: [External] : pf_state_key_unref: panic: kernel diagnostic > assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", > line 826 > > On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote: > > Hello Sebastien, > > > > On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote: > > > Hi, > > > > > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD > > > 6.9-current (GENERIC.MP) #492: Sat May 1 17:37:28 MDT 2021). > > Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 > 16:08:13 MDT 2021 It looks to be similar than this one: https://marc.info/?l=openbsd-bugs=161968896108810 -- Olivier Cherrier Phone: +352691570680 mailto:o...@symacx.com
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 04:50:06PM +0200, Sebastien Marie wrote: > On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote: > > Hello Sebastien, > > > > thank you for additional info about previously working kernel. > > > > it looks like your older kernel, which works, might be running without > > my commit > > > > revision 1.1116 > > date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ > > lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; > > pf_state_key_link_reverse() is prone to race on parallel forwarding > > > > we need to adjust assertions. at time we call > > pf_state_key_link_reverse() > > is state_key either linked to correct reverse peer or not linked at all. > > The pf_state_key_link_reverse() is being called as a reader ons > > tate_lock. > > There might be more packets, which try to update the state key. > > > > OK bluhm@ > > > > > # cat /etc/pf.conf > > > # See pf.conf(5) and /etc/examples/pf.conf > > > # > > > # avoid skip on lo to get tryton redir > > > #set skip on lo > > > > > > block return# block stateless traffic > > > pass# establish keep-state > > > > > > # redir 80 -> 8000 > > > pass in proto tcp to (self) port 80 rdr-to > > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > > > > > > the rule above is also interesting bit. this may trigger NAT-64. > > I'm not using IPv6 on my hosts/routers. I'll try to use similar rule > > to try to reproduce the crash. > > I will first try with another rule: > > pass in inet6 proto tcp to (self) port 80 rdr-to > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > to avoid the possible NAT-64. the assert still trigger. > And next (if I still trigger the assert) I will rebuild a kernel and > backout this specific commit. trying with a kernel without this commit -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 05:01:35PM +0200, Florian Obser wrote: > On 2021-05-04 16:50 +02, Sebastien Marie wrote: > > On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote: > >> Hello Sebastien, > >> > >> thank you for additional info about previously working kernel. > >> > >> it looks like your older kernel, which works, might be running without > >> my commit > >> > >> revision 1.1116 > >> date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ > >> lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; > >> pf_state_key_link_reverse() is prone to race on parallel forwarding > >> > >> we need to adjust assertions. at time we call > >> pf_state_key_link_reverse() > >> is state_key either linked to correct reverse peer or not linked at > >> all. > >> The pf_state_key_link_reverse() is being called as a reader ons > >> tate_lock. > >> There might be more packets, which try to update the state key. > >> > >> OK bluhm@ > >> > >> > # cat /etc/pf.conf > >> > # See pf.conf(5) and /etc/examples/pf.conf > >> > # > >> > # avoid skip on lo to get tryton redir > >> > #set skip on lo > >> > > >> > block return# block stateless traffic > >> > pass# establish keep-state > >> > > >> > # redir 80 -> 8000 > >> > pass in proto tcp to (self) port 80 rdr-to > >> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > >> > > >> > >> the rule above is also interesting bit. this may trigger NAT-64. > >> I'm not using IPv6 on my hosts/routers. I'll try to use similar rule > >> to try to reproduce the crash. > > > > I will first try with another rule: > > > > pass in inet6 proto tcp to (self) port 80 rdr-to > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > > > to avoid the possible NAT-64. > > fwiw I intentionally run NAT64 and I don't see this. > But I only have one client, my android phone. > > kern.version=OpenBSD 6.9-current (GENERIC.MP) #559: Fri Apr 30 03:31:48 MDT > 2021 > > > set skip on lo > block return # block stateless traffic > pass # establish keep-state > pass in log on vlan64 inet6 from any to 64:ff9b::/96 af-to inet \ > from 192.168.178.23 > > indeed, NAT and NAT-64 are essentially same from state table point of view. it looks like the issue is rather triggered by local-bound traffic, which matches rdr-to rule. We might be missing a reference when linking a state key to pcb bound to local socket. This is where I'm currently looking at. thanks and regards sashan
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On 2021-05-04 16:50 +02, Sebastien Marie wrote: > On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote: >> Hello Sebastien, >> >> thank you for additional info about previously working kernel. >> >> it looks like your older kernel, which works, might be running without >> my commit >> >> revision 1.1116 >> date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ >> lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; >> pf_state_key_link_reverse() is prone to race on parallel forwarding >> >> we need to adjust assertions. at time we call pf_state_key_link_reverse() >> is state_key either linked to correct reverse peer or not linked at all. >> The pf_state_key_link_reverse() is being called as a reader ons >> tate_lock. >> There might be more packets, which try to update the state key. >> >> OK bluhm@ >> >> > # cat /etc/pf.conf >> > # See pf.conf(5) and /etc/examples/pf.conf >> > # >> > # avoid skip on lo to get tryton redir >> > #set skip on lo >> > >> > block return# block stateless traffic >> > pass# establish keep-state >> > >> > # redir 80 -> 8000 >> > pass in proto tcp to (self) port 80 rdr-to >> > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 >> > >> >> the rule above is also interesting bit. this may trigger NAT-64. >> I'm not using IPv6 on my hosts/routers. I'll try to use similar rule >> to try to reproduce the crash. > > I will first try with another rule: > > pass in inet6 proto tcp to (self) port 80 rdr-to > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > to avoid the possible NAT-64. fwiw I intentionally run NAT64 and I don't see this. But I only have one client, my android phone. kern.version=OpenBSD 6.9-current (GENERIC.MP) #559: Fri Apr 30 03:31:48 MDT 2021 set skip on lo block return# block stateless traffic pass# establish keep-state pass in log on vlan64 inet6 from any to 64:ff9b::/96 af-to inet \ from 192.168.178.23 > > And next (if I still trigger the assert) I will rebuild a kernel and > backout this specific commit. > > It should permit to see if the commit is the problem or not. > > Thanks. -- I'm not entirely sure you are real.
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 02:15:05PM +0200, Alexandr Nedvedicky wrote: > Hello Sebastien, > > thank you for additional info about previously working kernel. > > it looks like your older kernel, which works, might be running without > my commit > > revision 1.1116 > date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ >lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; > pf_state_key_link_reverse() is prone to race on parallel forwarding > > we need to adjust assertions. at time we call pf_state_key_link_reverse() > is state_key either linked to correct reverse peer or not linked at all. > The pf_state_key_link_reverse() is being called as a reader ons tate_lock. > There might be more packets, which try to update the state key. > > OK bluhm@ > > > # cat /etc/pf.conf > > # See pf.conf(5) and /etc/examples/pf.conf > > # > > # avoid skip on lo to get tryton redir > > #set skip on lo > > > > block return# block stateless traffic > > pass# establish keep-state > > > > # redir 80 -> 8000 > > pass in proto tcp to (self) port 80 rdr-to > > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > > > > the rule above is also interesting bit. this may trigger NAT-64. > I'm not using IPv6 on my hosts/routers. I'll try to use similar rule > to try to reproduce the crash. I will first try with another rule: pass in inet6 proto tcp to (self) port 80 rdr-to 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 to avoid the possible NAT-64. And next (if I still trigger the assert) I will rebuild a kernel and backout this specific commit. It should permit to see if the commit is the problem or not. Thanks. -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Sebastien, thank you for additional info about previously working kernel. it looks like your older kernel, which works, might be running without my commit revision 1.1116 date: 2021/04/27 09:38:29; author: sashan; state: Exp;\ lines: +14 -6; commitid: 3W1fRTkLb3ZlUanF; pf_state_key_link_reverse() is prone to race on parallel forwarding we need to adjust assertions. at time we call pf_state_key_link_reverse() is state_key either linked to correct reverse peer or not linked at all. The pf_state_key_link_reverse() is being called as a reader ons tate_lock. There might be more packets, which try to update the state key. OK bluhm@ > # cat /etc/pf.conf > # See pf.conf(5) and /etc/examples/pf.conf > # > # avoid skip on lo to get tryton redir > #set skip on lo > > block return# block stateless traffic > pass# establish keep-state > > # redir 80 -> 8000 > pass in proto tcp to (self) port 80 rdr-to > 2001:41d0:fe39:c05c:56b8:d15b:2e0a:8775 port 8000 > the rule above is also interesting bit. this may trigger NAT-64. I'm not using IPv6 on my hosts/routers. I'll try to use similar rule to try to reproduce the crash. thanks and regards sashan
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
On Tue, May 04, 2021 at 11:47:55AM +0200, Alexandr Nedvedicky wrote: > Hello Sebastien, > > On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote: > > Hi, > > > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD > > 6.9-current (GENERIC.MP) #492: Sat May 1 17:37:28 MDT 2021). Previous working kernel was OpenBSD 6.9-current (GENERIC.MP) #477: Sat Apr 24 16:08:13 MDT 2021 > > panic: kernel diagnostic assertion "refcnt != ~0" failed: file > > "/usr/src/sys/kern/kern_synch.c", line 826 > > Starting stack trace... > > panic(81dfbc8e) at panic+0x11d > > __assert(81e64b54,81e0a6ee,33a,81e03b7f) at > > __assert+0x2b > > refcnt_rele(fd810bf02458) at refcnt_rele+0x6f > > pf_state_key_unref(fd810bf023f0) at pf_state_key_unref+0x21 > > pf_remove_state(fd810c0c4578) at pf_remove_state+0x1fa > > pf_purge_expired_states(2) at pf_purge_expired_states+0x232 > > pf_purge(82236a30) at pf_purge+0x33 > > taskq_thread(80032080) at taskq_thread+0x81 > > end trace frame: 0x0, count: 249 > > End of stack trace. > > > > The machine is running with ddb.panic=0, and hangs before rebooting (I > > assume it hangs during dump). I just changed the config to keep in > > ddb, and I should be able to get additionnal informations next time. > > > > > > As it is panicing in pf related code, some informations on the > > network: the machine has two networks cards (sis0 and em0) bridged, > > and a vlan (in rdomain). > > > > em0 is egress, and sis0 has only a network printer behind it. > > > > It panics whereas there is no particular interaction with the host, > > but time of panic is on state removal so I assume after a timeout. > > > > > The assert is being tripped out of sudden? or are you also modifying state > table (pfctl -Fs). or something else is happening forced reboot/power > off... nothing that I know. The assert seems be raised suddenly. But according to the stacktrace, it occured due to call to pf_purge_expired_states(), it is why I mentioned timeout. thanks. -- Sebastien Marie
Re: [External] : pf_state_key_unref: panic: kernel diagnostic assertion "refcnt != ~0" failed: file "/usr/src/sys/kern/kern_synch.c", line 826
Hello Sebastien, On Tue, May 04, 2021 at 11:08:19AM +0200, Sebastien Marie wrote: > Hi, > > Currently, I am regulary (~1 per day) get panic on an amd64 host (OpenBSD > 6.9-current (GENERIC.MP) #492: Sat May 1 17:37:28 MDT 2021). > > panic: kernel diagnostic assertion "refcnt != ~0" failed: file > "/usr/src/sys/kern/kern_synch.c", line 826 > Starting stack trace... > panic(81dfbc8e) at panic+0x11d > __assert(81e64b54,81e0a6ee,33a,81e03b7f) at > __assert+0x2b > refcnt_rele(fd810bf02458) at refcnt_rele+0x6f > pf_state_key_unref(fd810bf023f0) at pf_state_key_unref+0x21 > pf_remove_state(fd810c0c4578) at pf_remove_state+0x1fa > pf_purge_expired_states(2) at pf_purge_expired_states+0x232 > pf_purge(82236a30) at pf_purge+0x33 > taskq_thread(80032080) at taskq_thread+0x81 > end trace frame: 0x0, count: 249 > End of stack trace. > > The machine is running with ddb.panic=0, and hangs before rebooting (I > assume it hangs during dump). I just changed the config to keep in > ddb, and I should be able to get additionnal informations next time. > > > As it is panicing in pf related code, some informations on the > network: the machine has two networks cards (sis0 and em0) bridged, > and a vlan (in rdomain). > > em0 is egress, and sis0 has only a network printer behind it. > > It panics whereas there is no particular interaction with the host, > but time of panic is on state removal so I assume after a timeout. > The assert is being tripped out of sudden? or are you also modifying state table (pfctl -Fs). or something else is happening forced reboot/power off... thanks and regards sashan