Public bug reported:

Binary package hint: sysklogd

This is similar to bug #218320, but occurs in a different context. The
failure mode here is a bit complicated, so bear with me as I describe
it.

Ubuntu Jaunty system. sysklogd and klogd packages at version
1.5-5ubuntu3. passwd and group lookups handled by LDAP.

In certain circumstances (which I will explain later), the system hangs
at "Starting kernel log daemon..." on bootup. Logins are not even
possible at this point---you enter "root", the password, and it just
hangs there. Networking is not up.

I modified one of the /etc/event.d/tty* scripts to start up a root shell
directly on the console. Examination of the process table at the point
of the hang showed that the start-stop-daemon(8) command that actually
invokes /sbin/klogd, from /etc/rc2.d/S11klogd (a.k.a. /etc/init.d/klogd)
had become stuck.

I modified the init.d script to run the problematic start-stop-daemon(8)
invocation through strace(1). Rebooted, another hang. The last part of
the generated strace log:

----8<----
socket(PF_FILE, 0x80002 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/dev/log"...}, 110) = 0
send(3, "<83>Jun  1 16:16:37 start-stop-daemon: nss_ldap: could not connect to 
any LDAP server as cn=nssldap,ou=Accounts,dc=example,dc=com - Can't contact 
LDAP server"..., 158, MSG_NOSIGNAL) = 158
time(NULL)                              = 1243887397
send(3, "<86>Jun  1 16:16:37 start-stop-daemon: nss_ldap: failed to bind to 
LDAP server ldaps://172.25.85.16/: Can't contact LDAP server"..., 127, 
MSG_NOSIGNAL) = 127
time(NULL)                              = 1243887397
send(3, "<86>Jun  1 16:16:37 start-stop-daemon: nss_ldap: reconnecting to LDAP 
server..."..., 79, MSG_NOSIGNAL
---->8----

The system cannot connect to the LDAP server because the network is not
up. But networking should be up already at this point, yes? Well, I also
see this in the process table (via "pstree -ap"):

----8<----
init,1
[...]
  |-ifup,1120 --allow auto eth0
  |   `-sh,2107 -c ...
  |       `-dhclient3,2110 -e IF_METRIC=100 -pf /var/run/dhclient.eth0.pid -lf 
/var/lib/dhcp3/dhclient.eth0.leases eth0
---->8----

What is dhclient3 doing? Attaching with strace(1) shows

----8<----
Process 2110 attached - interrupt to quit
send(6, "<30>Jun  1 16:16:39 dhclient: DHCPDISCOVER on eth0 to 255.255.255.255 
port 67 interval 7"..., 88, MSG_NOSIGNAL <unfinished ...>
---->8----

Looks like it's trying to talk to /dev/log, just as start-stop-daemon(8)
was doing. Now, /dev/log is a socket handled by syslogd(8). So it seems
syslogd(8) might be slacking off, causing the holdups in these two
programs. If we attach to syslogd(8) with strace(1), what do we see?

----8<----
Process 2272 attached - interrupt to quit
send(17, "<83>Jun  1 16:57:32 syslogd: nss_ldap: could not connect to any LDAP 
server as cn=nssldap,ou=Accounts,dc=example,dc=com - Can't contact LDAP 
server"..., 148, MSG_NOSIGNAL <unfinished ...>
---->8----

Uh oh. Is syslogd(8) itself similarly stuck, trying to talk to... er...
itself?

I didn't really dig further, but you get the idea of some sort of
reentrancy-deadlock with LDAP lookups that horks up syslogd(8), and in
turn, everything else. There are two interesting points of note to add:

1. If you send SIGTERM to syslogd(8), killing it, the logjam clears up
completely. Networking comes up, the boot process finishes, and you can
log in. No syslog, but hey, working system.

2. The reason why this normally doesn't happen is because of nssldap-
update-ignoreusers(8). This command is automatically invoked as part of
the shutdown process; it appends a line to /etc/ldap.conf specifying
users (or groups?) to be ignored by LDAP, presumably avoiding the
superfluous lookups that produce the type of errors seen in the above
excerpts. My system encountered this because it was "ghosted" from
another system that had not undergone a reboot after installing LDAP, so
the particular line was missing from /etc/ldap.conf, and so the
problematic lookups were occurring. (If I reboot the system normally
after SIGTERM'ing syslogd(8) as above, the problem goes away.)

I think it's not enough to say that just running nssldap-update-
ignoreusers(8) fixes the problem and thus there is no bug; this is
clearly a race condition that syslogd(8) right now does not handle
gracefully, and ought to do so. It's entirely possible that other
manifestations of the hanging behavior, like bug #218320, have the same
underlying cause. I believe the ignoreusers bit should ultimately be
there just to avoid spurious LDAP errors, not because its absence breaks
syslog and system bootup.

** Affects: sysklogd (Ubuntu)
     Importance: Undecided
         Status: New

-- 
Hangs at "Starting kernel log daemon..." with LDAP
https://bugs.launchpad.net/bugs/382615
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to