Bug#946847: sssd_be: Busy loops on flaky LDAP, SIGTERM from watchdog not processed

Dominik George Mon, 16 Dec 2019 06:24:48 -0800

Package: sssd
Version: 2.2.2-1+b1
Severity: important
Tags: upstream

In a setup with sssd using a remote slapd for NSS, and a somewhat flaky
network in between, sssd_be tends to get into a busy loop sometimes, using
100% CPU time on one core.


Debugging showed that sssd has a watchdog to clean up in such cases, but
sssd_be installs a signal handler that prevents the SIGTERM on the
processgroup to be processed correctly, and does not exit.

src/util/util_watchdog.c:

     64 /* the watchdog is purposefully *not* handled by the tevent
     65  * signal handler as it is meant to check if the daemon is
     66  * still processing the event queue itself. A stuck process
     67  * may not handle the event queue at all and thus not handle
     68  * signals either */
     69 static void watchdog_handler(int sig)
     70 {
     71 
     72     watchdog_detect_timeshift();
     73 
     74     /* if a pre-defined number of ticks passed by kills itself */
     75     if (__sync_add_and_fetch(&watchdog_ctx.ticks, 1) > 
WATCHDOG_MAX_TICKS) {
     76         if (getpid() == getpgrp()) {
     77             kill(-getpgrp(), SIGTERM);
     78         } else {
     79             _exit(1);
     80         }
     81     }
     82 }

(NB. Seems what is described in the comment was not all too successful ;)

The signal handler is installed in src/providers/data_provider_be.c:

    448 static void be_process_finalize(struct tevent_context *ev,
    449                                 struct tevent_signal *se,
    450                                 int signum,
    451                                 int count,
    452                                 void *siginfo,
    453                                 void *private_data)
    454 {
    455     struct be_ctx *be_ctx;
    456 
    457     be_ctx = talloc_get_type(private_data, struct be_ctx);
    458     talloc_free(be_ctx);
    459     orderly_shutdown(0);
    460 }
    461 
    462 static errno_t be_process_install_sigterm_handler(struct be_ctx *be_ctx)
    463 {
    464     struct tevent_signal *sige;
    465 
    466     BlockSignals(false, SIGTERM);
    467 
    468     sige = tevent_add_signal(be_ctx->ev, be_ctx, SIGTERM, SA_SIGINFO,
    469                              be_process_finalize, be_ctx);
    470     if (sige == NULL) {
    471         DEBUG(SSSDBG_CRIT_FAILURE, "tevent_add_signal failed.\n");
    472         return ENOMEM;
    473     }
    474 
    475     return EOK;
    476 }

Setting a breakpoint on be_process_finalize showed that this function is
never reached, probably because libtevent never gets around to calling it.

Two proposals to circumvent this are:

 a) Reset the handler before calling kill on the process group in line 77
    (e.g. signal(SIGTERM, SIG_DFL);)
 b) Move the exit call in line 79 out of the branch so it gets called 
unconditionally
    in case kill() fails to kill the process itself

We tested solution a) in gdb and it caused sssd_be to exit cleanly and
restart, as it should.

Cheers,
Nik

Analysis was sponsored by Teckids e.V. and tarent solutions GmbH.

-- System Information:
Debian Release: bullseye/sid
  APT prefers testing-debug
  APT policy: (500, 'testing-debug'), (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.3.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), 
LANGUAGE=de_DE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages sssd depends on:
ii  python3-sss  2.2.2-1+b1
ii  sssd-ad      2.2.2-1+b1
ii  sssd-common  2.2.2-1+b1
ii  sssd-ipa     2.2.2-1+b1
ii  sssd-krb5    2.2.2-1+b1
ii  sssd-ldap    2.2.2-1+b1
ii  sssd-proxy   2.2.2-1+b1

sssd recommends no packages.

sssd suggests no packages.

-- no debconf information

Bug#946847: sssd_be: Busy loops on flaky LDAP, SIGTERM from watchdog not processed

Reply via email to