Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-13 Thread Martin Pieuchot
On 13/11/17(Mon) 12:33, Stuart Henderson wrote:
> On 2017/11/13 13:17, Martin Pieuchot wrote:
> > [...]
> > So it seems that two of your CPU end up looking at/dealing with
> > corrupted memory...
> 
> Is that for sure? 2 does normally print a trace, 3 also drops into ddb.

But none of them print:

panic: spl assertion failure in soassertlocked.

However it might just be a race because the other CPU just entered panic
and set splassert_ctl to 0.

> Same after an hour or two uptime, but this time I get some "netlock:
> lock not held" from some cpu or other, and some functions in the bits of
> the trace that get displayed:
> 
> login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: file 
> "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
> Starting stack trace...
> panic() at panic+0x11b
> __assert(812105d4,80001f898a70,ff0063dc5b00,ff0061804318) 
> at __assert+0x24
> sbappendaddr(0,ff0061804318,ff005fca5600,0,ff0063dc5b00) at 
> sbappendaddrpanic: netlock: lock not held

Does the diff below help?  It should in any case reduce the "netlock:
lock not held" noises.

Index: net/pfkeyv2.c
===
RCS file: /cvs/src/sys/net/pfkeyv2.c,v
retrieving revision 1.173
diff -u -p -r1.173 pfkeyv2.c
--- net/pfkeyv2.c   12 Nov 2017 14:11:15 -  1.173
+++ net/pfkeyv2.c   13 Nov 2017 12:57:36 -
@@ -428,12 +428,14 @@ pfkeyv2_sendmessage(void **headers, int 
 * Search for promiscuous listeners, skipping the
 * original destination.
 */
+   KERNEL_LOCK();
LIST_FOREACH(s, _sockets, kcb_list) {
if ((s->flags & PFKEYV2_SOCKETFLAGS_PROMISC) &&
(s->rcb.rcb_socket != so) &&
(s->rdomain == rdomain))
pfkey_sendup(s, packet, 1);
}
+   KERNEL_UNLOCK();
m_freem(packet);
break;
 
@@ -442,6 +444,7 @@ pfkeyv2_sendmessage(void **headers, int 
 * Send the message to all registered sockets that match
 * the specified satype (e.g., all IPSEC-ESP negotiators)
 */
+   KERNEL_LOCK();
LIST_FOREACH(s, _sockets, kcb_list) {
if ((s->flags & PFKEYV2_SOCKETFLAGS_REGISTERED) &&
(s->rdomain == rdomain)) {
@@ -454,6 +457,7 @@ pfkeyv2_sendmessage(void **headers, int 
}
}
}
+   KERNEL_UNLOCK();
/* Free last/original copy of the packet */
m_freem(packet);
 
@@ -472,21 +476,25 @@ pfkeyv2_sendmessage(void **headers, int 
goto ret;
 
/* Send to all registered promiscuous listeners */
+   KERNEL_LOCK();
LIST_FOREACH(s, _sockets, kcb_list) {
if ((s->flags & PFKEYV2_SOCKETFLAGS_PROMISC) &&
!(s->flags & PFKEYV2_SOCKETFLAGS_REGISTERED) &&
(s->rdomain == rdomain))
pfkey_sendup(s, packet, 1);
}
+   KERNEL_UNLOCK();
m_freem(packet);
break;
 
case PFKEYV2_SENDMESSAGE_BROADCAST:
/* Send message to all sockets */
+   KERNEL_LOCK();
LIST_FOREACH(s, _sockets, kcb_list) {
if (s->rdomain == rdomain)
pfkey_sendup(s, packet, 1);
}
+   KERNEL_UNLOCK();
m_freem(packet);
break;
}
@@ -1010,11 +1018,13 @@ pfkeyv2_send(struct socket *so, void *me
goto ret;
 
/* Send to all promiscuous listeners */
+   KERNEL_LOCK();
LIST_FOREACH(bkp, _sockets, kcb_list) {
if ((bkp->flags & PFKEYV2_SOCKETFLAGS_PROMISC) &&
(bkp->rdomain == rdomain))
pfkey_sendup(bkp, packet, 1);
}
+   KERNEL_UNLOCK();
 
m_freem(packet);
 
@@ -1788,12 +1798,15 @@ pfkeyv2_send(struct socket *so, void *me
if ((rval = pfdatatopacket(message, len, )) != 0)
goto ret;
 
-   LIST_FOREACH(bkp, _sockets, kcb_list)
+   KERNEL_LOCK();
+   LIST_FOREACH(bkp, _sockets, kcb_list) {
if ((bkp != kp) &&
(bkp->rdomain == rdomain) &&
(!smsg->sadb_msg_seq ||
(smsg->sadb_msg_seq == kp->pid)))
pfkey_sendup(bkp, packet, 1);
+

Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-13 Thread Sebastien Marie
On Mon, Nov 13, 2017 at 12:33:35PM +, Stuart Henderson wrote:
> 
> Same after an hour or two uptime, but this time I get some "netlock:
> lock not held" from some cpu or other, and some functions in the bits of
> the trace that get displayed:
> 
> login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: file 
> "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310

just a simple question regarding the previous line.

does the start of the line ("login: ") is part of the kernel output or
it is just the login(1) prompt on console (printed long time before the
panic) and you copied the whole line ?

thanks.
-- 
Sebastien Marie



Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-13 Thread Stuart Henderson
On 2017/11/13 13:17, Martin Pieuchot wrote:
> On 13/11/17(Mon) 10:03, Stuart Henderson wrote:
> > On 2017/11/13 08:44, Martin Pieuchot wrote:
> > > On 12/11/17(Sun) 22:10, Stuart Henderson wrote:
> > > > On 2017/11/12 22:48, Martin Pieuchot wrote:
> > > > > On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> > > > > > iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:
> > > > > 
> > > > > Weird, did you tweak "kern.splassert" on this box?   Otherwise is 
> > > > > looks
> > > > > like a major corruption.
> > > > 
> > > > It would have kern.splassert=2. (I know this can cause problems
> > > > sometimes, though this would be the first time in 5+ years I've bumped
> > > > into it, most of my routers where I have serial console have this set).
> > > 
> > > Well the panic below correspond to a value of 0 or > 3.
> > 
> > Confirmed, it was definitely set to 2.
> 
> So it seems that two of your CPU end up looking at/dealing with
> corrupted memory...

Is that for sure? 2 does normally print a trace, 3 also drops into ddb.

> > > > I'm trying to get more information because it had either hanged or
> > > > panicked previously (it didn't have serial connected at the time and
> > > > the machine was needed so it had to be rebooted before I had chance
> > > > to dig into it).
> > > 
> > > From which snapshot was the kernel that hanged or panic'd?
> > > 
> > 
> > It was running this:
> > 
> > OpenBSD 6.2-current (GENERIC.MP) #199: Tue Nov  7 18:41:54 MST 2017
> > 
> > I've got it onto a remote control PDU now, now looking for some machine
> > with an old enough ssh client to be able to connect to the PDU :-|
> > 
> > Which kernel would be most useful to run now?
> 
> -current
> 
> > I have now moved it to -current GENERIC.MP with the "fast path chunk
> > removed from amd64/amd64/fpu.c fpu_kernel_enter() which we still suspect
> > as maybe having some issues.
> 
> That's perfect from my point of view.
> 

Same after an hour or two uptime, but this time I get some "netlock:
lock not held" from some cpu or other, and some functions in the bits of
the trace that get displayed:

login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: file 
"/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
Starting stack trace...
panic() at panic+0x11b
__assert(812105d4,80001f898a70,ff0063dc5b00,ff0061804318) 
at __assert+0x24
sbappendaddr(0,ff0061804318,ff005fca5600,0,ff0063dc5b00) at 
sbappendaddrpanic: netlock: lock not held
Faulted in traceback, aborting...
+0x276
pfkey_sendup(4,c,808f8b00) at pfkey_sendup+0x75
pfkeyv2_sendmessage(ff00617e9160,80902700,ff00617e00a0,1,809027d8,2)
 at pfkeyv2_sendmessage+0x228
pfkeyv2_acquire(ff00617e924c,ff0067772090,ff006777201c,ff00617e9160,80001f898dc8)
 at pfkeyv2_acquire+0x553
ipsp_acquire_sa(ff00617e9160,0,804d3880,80001f898f20,0) at 
panic: netlock: lock not heldipsp_acquire_sa
Faulted in traceback, aborting...
+0x4c6panic: netlock: lock not held
Faulted in traceback, aborting...

panic: netlock: lock not held
Faulted in traceback, aborting...
ipsp_spd_lookup(panic: ff0005747400,netlock: lock not held
Faulted in traceback, aborting...
0,panic: netlock: lock not held804dc900,80001f898fb0
Faulted in traceback, aborting...
,panic: netlock: lock not held
Faulted in traceback, aborting...
0,panic: netlock: lock not held
Faulted in traceback, aborting...
9c519d9d517a98c1) at panic: netlock: lock not held
Faulted in traceback, aborting...
ipsp_spd_lookuppanic: netlock: lock not held+0xcbe
Faulted in traceback, aborting...

panic: netlock: lock not held
Faulted in traceback, aborting...
ip_output_ipsec_lookup(panic: netlock: lock not held
Faulted in traceback, aborting...
80001f898fc0,panic: netlock: lock not held
Faulted in traceback, aborting...
ff006276f4d4,panic: netlock: lock not held804dc900
Faulted in traceback, aborting...
,panic: netlock: lock not held
Faulted in traceback, aborting...
80001f898fb0,panic: netlock: lock not held
Faulted in traceback, aborting...
0) at panic: netlock: lock not held
Faulted in traceback, aborting...
ip_output_ipsec_lookuppanic: netlock: lock not held+0x34
Faulted in traceback, aborting...

panic: netlock: lock not held
Faulted in traceback, aborting...
ip_output(panic: netlock: lock not held
Faulted in traceback, aborting...
0,panic: 0,netlock: lock not held
Faulted in traceback, aborting...
1,panic: netlock: lock not held
Faulted in traceback, aborting...
ff00615ed020panic: netlock: lock not held
Faulted in traceback, aborting...
,panic: ff0005747400,netlock: lock not held
Faulted in traceback, aborting...
9c519d9d517a98c1) at panic: ip_outputnetlock: lock not held
Faulted in traceback, aborting...
+0x3e7panic: netlock: lock not held
Faulted in traceback, aborting...

panic: netlock: lock not held
Faulted in traceback, aborting...
ip_forward(panic: netlock: lock not held
Faulted in traceback, 

Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-13 Thread Martin Pieuchot
On 13/11/17(Mon) 10:03, Stuart Henderson wrote:
> On 2017/11/13 08:44, Martin Pieuchot wrote:
> > On 12/11/17(Sun) 22:10, Stuart Henderson wrote:
> > > On 2017/11/12 22:48, Martin Pieuchot wrote:
> > > > On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> > > > > iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:
> > > > 
> > > > Weird, did you tweak "kern.splassert" on this box?   Otherwise is looks
> > > > like a major corruption.
> > > 
> > > It would have kern.splassert=2. (I know this can cause problems
> > > sometimes, though this would be the first time in 5+ years I've bumped
> > > into it, most of my routers where I have serial console have this set).
> > 
> > Well the panic below correspond to a value of 0 or > 3.
> 
> Confirmed, it was definitely set to 2.

So it seems that two of your CPU end up looking at/dealing with
corrupted memory...

> > > I'm trying to get more information because it had either hanged or
> > > panicked previously (it didn't have serial connected at the time and
> > > the machine was needed so it had to be rebooted before I had chance
> > > to dig into it).
> > 
> > From which snapshot was the kernel that hanged or panic'd?
> > 
> 
> It was running this:
> 
> OpenBSD 6.2-current (GENERIC.MP) #199: Tue Nov  7 18:41:54 MST 2017
> 
> I've got it onto a remote control PDU now, now looking for some machine
> with an old enough ssh client to be able to connect to the PDU :-|
> 
> Which kernel would be most useful to run now?

-current

> I have now moved it to -current GENERIC.MP with the "fast path chunk
> removed from amd64/amd64/fpu.c fpu_kernel_enter() which we still suspect
> as maybe having some issues.

That's perfect from my point of view.



Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-13 Thread Stuart Henderson
On 2017/11/13 08:44, Martin Pieuchot wrote:
> On 12/11/17(Sun) 22:10, Stuart Henderson wrote:
> > On 2017/11/12 22:48, Martin Pieuchot wrote:
> > > On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> > > > iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:
> > > 
> > > Weird, did you tweak "kern.splassert" on this box?   Otherwise is looks
> > > like a major corruption.
> > 
> > It would have kern.splassert=2. (I know this can cause problems
> > sometimes, though this would be the first time in 5+ years I've bumped
> > into it, most of my routers where I have serial console have this set).
> 
> Well the panic below correspond to a value of 0 or > 3.

Confirmed, it was definitely set to 2.

> > > > login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: 
> > > > file "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
> > > ^^^
> > > Looks like one CPU is triggering this.
> > > 
> > > > splassert: soassertlocked: want 1 have 256
> > > > 
> > > > panic: spl assertion failure in soassertlocked
> > > ^^^
> > > That can't be coming from the same CPU..
> > > 
> > > 
> > > 
> > > 
> > > > Starting stack trace...
> > > > Faulted in traceback, aborting...
> > > > panic(splassert: if_down: want 1 have 256
> > > > panic: spl assertion failure in if_down) at
> > > > Faulted in traceback, aborting...
> > > > panicsplassert: if_down: want 1 have 256
> > > > +0x133panic: spl assertion failure in if_down
> > > > Faulted in traceback, aborting...
> > > > 
> > > > 
> > > > 
> > > > It's stuck at this point, I can't enter ddb.
> > > 
> > > Are you running with WITNESS on purpose?  Can you reproduce such problem
> > > without it?  I'm not saying it's WITNESS fault, but it's clear that
> > > WITNESS kernels aren't ready for production yet.
> > > 
> > 
> > I'm trying to get more information because it had either hanged or
> > panicked previously (it didn't have serial connected at the time and
> > the machine was needed so it had to be rebooted before I had chance
> > to dig into it).
> 
> From which snapshot was the kernel that hanged or panic'd?
> 

It was running this:

OpenBSD 6.2-current (GENERIC.MP) #199: Tue Nov  7 18:41:54 MST 2017

I've got it onto a remote control PDU now, now looking for some machine
with an old enough ssh client to be able to connect to the PDU :-|

Which kernel would be most useful to run now?

I have now moved it to -current GENERIC.MP with the "fast path chunk
removed from amd64/amd64/fpu.c fpu_kernel_enter() which we still suspect
as maybe having some issues.



Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-12 Thread Martin Pieuchot
On 12/11/17(Sun) 22:10, Stuart Henderson wrote:
> On 2017/11/12 22:48, Martin Pieuchot wrote:
> > On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> > > iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:
> > 
> > Weird, did you tweak "kern.splassert" on this box?   Otherwise is looks
> > like a major corruption.
> 
> It would have kern.splassert=2. (I know this can cause problems
> sometimes, though this would be the first time in 5+ years I've bumped
> into it, most of my routers where I have serial console have this set).

Well the panic below correspond to a value of 0 or > 3.

> > > login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: 
> > > file "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
> > ^^^
> > Looks like one CPU is triggering this.
> > 
> > > splassert: soassertlocked: want 1 have 256
> > > 
> > > panic: spl assertion failure in soassertlocked
> > ^^^
> > That can't be coming from the same CPU..
> > 
> > 
> > 
> > 
> > > Starting stack trace...
> > > Faulted in traceback, aborting...
> > > panic(splassert: if_down: want 1 have 256
> > > panic: spl assertion failure in if_down) at
> > > Faulted in traceback, aborting...
> > > panicsplassert: if_down: want 1 have 256
> > > +0x133panic: spl assertion failure in if_down
> > > Faulted in traceback, aborting...
> > > 
> > > 
> > > 
> > > It's stuck at this point, I can't enter ddb.
> > 
> > Are you running with WITNESS on purpose?  Can you reproduce such problem
> > without it?  I'm not saying it's WITNESS fault, but it's clear that
> > WITNESS kernels aren't ready for production yet.
> > 
> 
> I'm trying to get more information because it had either hanged or
> panicked previously (it didn't have serial connected at the time and
> the machine was needed so it had to be rebooted before I had chance
> to dig into it).

>From which snapshot was the kernel that hanged or panic'd?



Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-12 Thread Stuart Henderson
On 2017/11/12 22:48, Martin Pieuchot wrote:
> On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> > iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:
> 
> Weird, did you tweak "kern.splassert" on this box?   Otherwise is looks
> like a major corruption.

It would have kern.splassert=2. (I know this can cause problems
sometimes, though this would be the first time in 5+ years I've bumped
into it, most of my routers where I have serial console have this set).

> > login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: 
> > file "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
> ^^^
> Looks like one CPU is triggering this.
> 
> > splassert: soassertlocked: want 1 have 256
> > 
> > panic: spl assertion failure in soassertlocked
> ^^^
> That can't be coming from the same CPU..
> 
> 
> 
> 
> > Starting stack trace...
> > Faulted in traceback, aborting...
> > panic(splassert: if_down: want 1 have 256
> > panic: spl assertion failure in if_down) at
> > Faulted in traceback, aborting...
> > panicsplassert: if_down: want 1 have 256
> > +0x133panic: spl assertion failure in if_down
> > Faulted in traceback, aborting...
> > 
> > 
> > 
> > It's stuck at this point, I can't enter ddb.
> 
> Are you running with WITNESS on purpose?  Can you reproduce such problem
> without it?  I'm not saying it's WITNESS fault, but it's clear that
> WITNESS kernels aren't ready for production yet.
> 

I'm trying to get more information because it had either hanged or
panicked previously (it didn't have serial connected at the time and
the machine was needed so it had to be rebooted before I had chance
to dig into it).



Re: assertion "_kernel_lock_held()" failed, uipc_socket2.c: ipsec

2017-11-12 Thread Martin Pieuchot
On 12/11/17(Sun) 21:30, Stuart Henderson wrote:
> iked box, GENERIC.MP + WITNESS, -current as of Friday 10th:

Weird, did you tweak "kern.splassert" on this box?   Otherwise is looks
like a major corruption.

> login: panic: kernel diagnostic assertion "_kernel_lock_held()" failed: file 
> "/src/cvs-openbsd/sys/kern/uipc_socket2.c", line 310
^^^
Looks like one CPU is triggering this.

> splassert: soassertlocked: want 1 have 256
> 
> panic: spl assertion failure in soassertlocked
^^^
That can't be coming from the same CPU..




> Starting stack trace...
> Faulted in traceback, aborting...
> panic(splassert: if_down: want 1 have 256
> panic: spl assertion failure in if_down) at
> Faulted in traceback, aborting...
> panicsplassert: if_down: want 1 have 256
> +0x133panic: spl assertion failure in if_down
> Faulted in traceback, aborting...
> 
> 
> 
> It's stuck at this point, I can't enter ddb.

Are you running with WITNESS on purpose?  Can you reproduce such problem
without it?  I'm not saying it's WITNESS fault, but it's clear that
WITNESS kernels aren't ready for production yet.