Re: [knot-dns-users] Bug report: knotd crashes due to PCKS #11 race condition

2018-09-17 Thread Daniel Salzman
Hi Rick,

I have reproduced the issue using a simulator of your HSM.

Daniel

On 08/27/2018 04:45 PM, Rick van Rein wrote:
> Hi,
> 
> Roland and I ran into a crashing condition for knotd 2.6.[689],
> presumably caused by a race condition in the threaded use of PKCS #11
> sessions.  We use a commercial, replicated, networked HSM and not SoftHSM2.
> 
> WORKAROUND:
> 
> We do have a work-around with "conf-set server.background-workers 1" so
> this is not a blocking condition for us, but handling our ~1700 zones
> concurrency would be add back later.
> 
> PROBLEM DESCRIPTION:
> 
> Without this work-around, we see crashes quite reliably, on a load that
> does a number of zone-set/-unset commands, fired by sequentialised knotc
> processes to a knotd that continues to fire zone signing concurrently.
> The commands are generated with the knot-aware option -k from ldns-zonediff,
> https://github.com/SURFnet/ldns-zonediff
> 
> ANALYSIS:
> 
> Our HSM reports errors that look like a session handle is reused and
> then repeatedly logged into, but not always, so it looks like a race
> condition on a session variable,
> 
> 27.08.2018 11:48:59 | [6AE9:6AEE] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:48:59 | [6AE9:6AEE] C_GetAttributeValue
>| E: Error CKR_USER_NOT_LOGGED_IN occurred.
> 27.08.2018 11:48:59 | [6AE9:6AED] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:48:59 | [6AE9:6AED] C_GetAttributeValue
>| E: Error CKR_USER_NOT_LOGGED_IN occurred.
> 27.08.2018 11:49:01 | [6AE9:6AED] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:49:01 | [6AE9:6AED] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:49:01 | [6AE9:6AED] C_GetAttributeValue
>| E: Error CKR_USER_NOT_LOGGED_IN occurred.
> 27.08.2018 11:49:02 | [6AE9:6AEE] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:49:03 | [6AE9:6AEE] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 27.08.2018 11:55:50 | [744C:744E] C_Login
>| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
> 
> These errors stopped being reported with the work-around configured.
> Until that time, we have crashes, of which the following dumps one:
> 
> Thread 4 "knotd" received signal SIGABRT, Aborted.
> [Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
> 0x76967428 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> 54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
> (gdb) bt
> #0  0x76967428 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x7696902a in __GI_abort () at abort.c:89
> #2  0x769a97ea in __libc_message (do_abort=do_abort@entry=2,
> fmt=fmt@entry=0x76ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at
> ../sysdeps/posix/libc_fatal.c:175
> #3  0x769b237a in malloc_printerr (ar_ptr=,
> ptr=,
> str=0x76ac2fe8 "double free or corruption (out)", action=3) at
> malloc.c:5006
> #4  _int_free (av=, p=, have_lock=0) at
> malloc.c:3867
> #5  0x769b653c in __GI___libc_free (mem=) at
> malloc.c:2968
> #6  0x55597ed3 in ?? ()
> #7  0x555987c2 in ?? ()
> #8  0x5559ba01 in ?? ()
> #9  0x77120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4
> #10 0x76d036ba in start_thread (arg=0x7fffcd1bd700) at
> pthread_create.c:333
> #11 0x76a3941d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> 
> DEBUGGING HINTS:
> 
> Our suspicion is that you may not have set the mutex callbacks when
> invoking C_Initialize() on PKCS #11, possibly due to the intermediate
> layers of abstraction hiding this from view.  This happens more often.
> 
> Then again, the double free might pose another hint.
> 
> This is on our soon-to-go-live platform, so I'm afraid it'll be very
> difficult to do much more testing, I hope this suffices for your debugging!
> 
> 
> I hope this helps Knot DNS to move forward!
>  -Rick
> 
-- 
https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users


Re: [knot-dns-users] Bug report: knotd crashes due to PCKS #11 race condition

2018-08-27 Thread daniel . salzman

Hi Rick,

It's not clear to me what is going on there :-/

Are you able to reproduce this issue with SoftHSM2? Or would it be 
possible

to give me access to the HSM? Otherwise, I have no idea how to debug the
crash.

Daniel

On 2018-08-27 16:45, Rick van Rein wrote:

Hi,

Roland and I ran into a crashing condition for knotd 2.6.[689],
presumably caused by a race condition in the threaded use of PKCS #11
sessions.  We use a commercial, replicated, networked HSM and not 
SoftHSM2.


WORKAROUND:

We do have a work-around with "conf-set server.background-workers 1" so
this is not a blocking condition for us, but handling our ~1700 zones
concurrency would be add back later.

PROBLEM DESCRIPTION:

Without this work-around, we see crashes quite reliably, on a load that
does a number of zone-set/-unset commands, fired by sequentialised 
knotc

processes to a knotd that continues to fire zone signing concurrently.
The commands are generated with the knot-aware option -k from 
ldns-zonediff,

https://github.com/SURFnet/ldns-zonediff

ANALYSIS:

Our HSM reports errors that look like a session handle is reused and
then repeatedly logged into, but not always, so it looks like a race
condition on a session variable,

27.08.2018 11:48:59 | [6AE9:6AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [6AE9:6AEE] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:48:59 | [6AE9:6AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [6AE9:6AED] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:01 | [6AE9:6AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [6AE9:6AED] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [6AE9:6AED] C_GetAttributeValue
   | E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:02 | [6AE9:6AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:03 | [6AE9:6AEE] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:55:50 | [744C:744E] C_Login
   | E: Error CKR_USER_ALREADY_LOGGED_IN occurred.

These errors stopped being reported with the work-around configured.
Until that time, we have crashes, of which the following dumps one:

Thread 4 "knotd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
0x76967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x76967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x7696902a in __GI_abort () at abort.c:89
#2  0x769a97ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x76ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") 
at

../sysdeps/posix/libc_fatal.c:175
#3  0x769b237a in malloc_printerr (ar_ptr=,
ptr=,
str=0x76ac2fe8 "double free or corruption (out)", action=3) at
malloc.c:5006
#4  _int_free (av=, p=, have_lock=0) at
malloc.c:3867
#5  0x769b653c in __GI___libc_free (mem=) at
malloc.c:2968
#6  0x55597ed3 in ?? ()
#7  0x555987c2 in ?? ()
#8  0x5559ba01 in ?? ()
#9  0x77120338 in ?? () from 
/usr/lib/x86_64-linux-gnu/liburcu.so.4

#10 0x76d036ba in start_thread (arg=0x7fffcd1bd700) at
pthread_create.c:333
#11 0x76a3941d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

DEBUGGING HINTS:

Our suspicion is that you may not have set the mutex callbacks when
invoking C_Initialize() on PKCS #11, possibly due to the intermediate
layers of abstraction hiding this from view.  This happens more often.

Then again, the double free might pose another hint.

This is on our soon-to-go-live platform, so I'm afraid it'll be very
difficult to do much more testing, I hope this suffices for your 
debugging!



I hope this helps Knot DNS to move forward!
 -Rick

--
https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users