Re: [knot-dns-users] Bug report: knotd crashes due to PCKS #11 race condition
Hi Rick, I have reproduced the issue using a simulator of your HSM. Daniel On 08/27/2018 04:45 PM, Rick van Rein wrote: > Hi, > > Roland and I ran into a crashing condition for knotd 2.6.[689], > presumably caused by a race condition in the threaded use of PKCS #11 > sessions. We use a commercial, replicated, networked HSM and not SoftHSM2. > > WORKAROUND: > > We do have a work-around with "conf-set server.background-workers 1" so > this is not a blocking condition for us, but handling our ~1700 zones > concurrency would be add back later. > > PROBLEM DESCRIPTION: > > Without this work-around, we see crashes quite reliably, on a load that > does a number of zone-set/-unset commands, fired by sequentialised knotc > processes to a knotd that continues to fire zone signing concurrently. > The commands are generated with the knot-aware option -k from ldns-zonediff, > https://github.com/SURFnet/ldns-zonediff > > ANALYSIS: > > Our HSM reports errors that look like a session handle is reused and > then repeatedly logged into, but not always, so it looks like a race > condition on a session variable, > > 27.08.2018 11:48:59 | [6AE9:6AEE] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:48:59 | [6AE9:6AEE] C_GetAttributeValue >| E: Error CKR_USER_NOT_LOGGED_IN occurred. > 27.08.2018 11:48:59 | [6AE9:6AED] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:48:59 | [6AE9:6AED] C_GetAttributeValue >| E: Error CKR_USER_NOT_LOGGED_IN occurred. > 27.08.2018 11:49:01 | [6AE9:6AED] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:49:01 | [6AE9:6AED] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:49:01 | [6AE9:6AED] C_GetAttributeValue >| E: Error CKR_USER_NOT_LOGGED_IN occurred. > 27.08.2018 11:49:02 | [6AE9:6AEE] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:49:03 | [6AE9:6AEE] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > 27.08.2018 11:55:50 | [744C:744E] C_Login >| E: Error CKR_USER_ALREADY_LOGGED_IN occurred. > > These errors stopped being reported with the work-around configured. > Until that time, we have crashes, of which the following dumps one: > > Thread 4 "knotd" received signal SIGABRT, Aborted. > [Switching to Thread 0x7fffcd1bd700 (LWP 27375)] > 0x76967428 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:54 > 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > (gdb) bt > #0 0x76967428 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/unix/sysv/linux/raise.c:54 > #1 0x7696902a in __GI_abort () at abort.c:89 > #2 0x769a97ea in __libc_message (do_abort=do_abort@entry=2, > fmt=fmt@entry=0x76ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at > ../sysdeps/posix/libc_fatal.c:175 > #3 0x769b237a in malloc_printerr (ar_ptr=, > ptr=, > str=0x76ac2fe8 "double free or corruption (out)", action=3) at > malloc.c:5006 > #4 _int_free (av=, p=, have_lock=0) at > malloc.c:3867 > #5 0x769b653c in __GI___libc_free (mem=) at > malloc.c:2968 > #6 0x55597ed3 in ?? () > #7 0x555987c2 in ?? () > #8 0x5559ba01 in ?? () > #9 0x77120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4 > #10 0x76d036ba in start_thread (arg=0x7fffcd1bd700) at > pthread_create.c:333 > #11 0x76a3941d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 > > DEBUGGING HINTS: > > Our suspicion is that you may not have set the mutex callbacks when > invoking C_Initialize() on PKCS #11, possibly due to the intermediate > layers of abstraction hiding this from view. This happens more often. > > Then again, the double free might pose another hint. > > This is on our soon-to-go-live platform, so I'm afraid it'll be very > difficult to do much more testing, I hope this suffices for your debugging! > > > I hope this helps Knot DNS to move forward! > -Rick > -- https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users
Re: [knot-dns-users] Bug report: knotd crashes due to PCKS #11 race condition
Hi Rick, It's not clear to me what is going on there :-/ Are you able to reproduce this issue with SoftHSM2? Or would it be possible to give me access to the HSM? Otherwise, I have no idea how to debug the crash. Daniel On 2018-08-27 16:45, Rick van Rein wrote: Hi, Roland and I ran into a crashing condition for knotd 2.6.[689], presumably caused by a race condition in the threaded use of PKCS #11 sessions. We use a commercial, replicated, networked HSM and not SoftHSM2. WORKAROUND: We do have a work-around with "conf-set server.background-workers 1" so this is not a blocking condition for us, but handling our ~1700 zones concurrency would be add back later. PROBLEM DESCRIPTION: Without this work-around, we see crashes quite reliably, on a load that does a number of zone-set/-unset commands, fired by sequentialised knotc processes to a knotd that continues to fire zone signing concurrently. The commands are generated with the knot-aware option -k from ldns-zonediff, https://github.com/SURFnet/ldns-zonediff ANALYSIS: Our HSM reports errors that look like a session handle is reused and then repeatedly logged into, but not always, so it looks like a race condition on a session variable, 27.08.2018 11:48:59 | [6AE9:6AEE] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:48:59 | [6AE9:6AEE] C_GetAttributeValue | E: Error CKR_USER_NOT_LOGGED_IN occurred. 27.08.2018 11:48:59 | [6AE9:6AED] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:48:59 | [6AE9:6AED] C_GetAttributeValue | E: Error CKR_USER_NOT_LOGGED_IN occurred. 27.08.2018 11:49:01 | [6AE9:6AED] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:49:01 | [6AE9:6AED] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:49:01 | [6AE9:6AED] C_GetAttributeValue | E: Error CKR_USER_NOT_LOGGED_IN occurred. 27.08.2018 11:49:02 | [6AE9:6AEE] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:49:03 | [6AE9:6AEE] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. 27.08.2018 11:55:50 | [744C:744E] C_Login | E: Error CKR_USER_ALREADY_LOGGED_IN occurred. These errors stopped being reported with the work-around configured. Until that time, we have crashes, of which the following dumps one: Thread 4 "knotd" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffcd1bd700 (LWP 27375)] 0x76967428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 0x76967428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x7696902a in __GI_abort () at abort.c:89 #2 0x769a97ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x76ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175 #3 0x769b237a in malloc_printerr (ar_ptr=, ptr=, str=0x76ac2fe8 "double free or corruption (out)", action=3) at malloc.c:5006 #4 _int_free (av=, p=, have_lock=0) at malloc.c:3867 #5 0x769b653c in __GI___libc_free (mem=) at malloc.c:2968 #6 0x55597ed3 in ?? () #7 0x555987c2 in ?? () #8 0x5559ba01 in ?? () #9 0x77120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4 #10 0x76d036ba in start_thread (arg=0x7fffcd1bd700) at pthread_create.c:333 #11 0x76a3941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 DEBUGGING HINTS: Our suspicion is that you may not have set the mutex callbacks when invoking C_Initialize() on PKCS #11, possibly due to the intermediate layers of abstraction hiding this from view. This happens more often. Then again, the double free might pose another hint. This is on our soon-to-go-live platform, so I'm afraid it'll be very difficult to do much more testing, I hope this suffices for your debugging! I hope this helps Knot DNS to move forward! -Rick -- https://lists.nic.cz/cgi-bin/mailman/listinfo/knot-dns-users