On Tue, Mar 9, 2010 at 10:03 AM, Florian Forster <o...@verplant.org> wrote: > Hi Ryan, > > On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote: >> collectd 4.8.1, http://collectd.org/ > > no applicable problem has been fixed in the Exec plugin in the meantime, > so the problem should still exist in the master branch. > >> Once I notice the plugin has stopped reporting, I have an extra >> process (28489) hanging around: >> >> $ pstree -apu 22935 >> collectdmon,22935 -P /var/run/collectdmon.pid -- -C >> /etc/collectd/collectd.conf >> collectd,22936 -C /etc/collectd/collectd.conf -f >> collectd,28489 -C /etc/collectd/collectd.conf -f >> {collectd},22937 >> {collectd},22938 >> {collectd},22939 >> {collectd},22940 >> {collectd},22941 >> {collectd},28487 > >> That process seems to exist only when the exec plugin is no longer >> reporting. Sometimes there's two of these processes. > > This looks like the code that is supposed to spawn a new instance of the > script failed after fork(2) but before exec(2). > > There are various cases in which the exec(2) is not reached in > "exec_child()", but they all emit an error message. I take it there is > no error message somewhere in the logs or in syslog?
I'm embarassed to admit I have not had the logs turned on. There's very likely an error message I'm not seeing. I'll switch them on today and see what happens. >> strace reports that the extra process is sitting in a mutex. It never >> leaves this state: >> >> $ sudo strace -p 28489 >> Process 28489 attached - interrupt to quit >> futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL > > There is a mutex in the exec plugin, but I doubt that this is the > problem. It is held just before a thread is spawned (to set a flag) and > just before that thread exits (to reset the flag). I don't see any way > this could lead to a deadlock or starvation. > > I'm much more concerned about the SIGCHLD handler and the various > waitpid(2)s in the code. I could see the controlling thread missing its > child's signal and waiting forever. This shouldn't create weird new > processes though. > >> Any ideas what might be going on here or information I could provide >> to help find a root cause? > > I'm a bit puzzled by the described behavior, I have to admit. Maybe you > could provide the "lsof -p $PID" output for one of those weird child > processes? Here's the lsof output: # lsof -p 25335 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME collectd 25335 root cwd DIR 254,0 4096 650195 /var/lib/collectd collectd 25335 root rtd DIR 254,0 4096 2 / collectd 25335 root txt REG 254,0 155960 114923 /usr/sbin/collectd collectd 25335 root mem REG 254,0 93016 262221 /lib/libgcc_s.so.1 collectd 25335 root mem REG 254,0 72568 262188 /lib/libresolv-2.7.so collectd 25335 root mem REG 254,0 18752 262195 /lib/libnss_dns-2.7.so collectd 25335 root mem REG 254,0 47520 262191 /lib/libnss_files-2.7.so collectd 25335 root mem REG 254,0 421512 122929 /usr/lib/libgcrypt.so.11.4.4 collectd 25335 root mem REG 254,0 40560 139512 /usr/lib/collectd/network.so collectd 25335 root mem REG 254,0 6624 139559 /usr/lib/collectd/memory.so collectd 25335 root mem REG 254,0 12424 139514 /usr/lib/collectd/memcached.so collectd 25335 root mem REG 254,0 5288 139551 /usr/lib/collectd/load.so collectd 25335 root mem REG 254,0 7208 139530 /usr/lib/collectd/interface.so collectd 25335 root mem REG 254,0 22128 139513 /usr/lib/collectd/exec.so collectd 25335 root mem REG 254,0 13344 139548 /usr/lib/collectd/df.so collectd 25335 root mem REG 254,0 6672 139586 /usr/lib/collectd/cpu.so collectd 25335 root mem REG 254,0 8040 139515 /usr/lib/collectd/logfile.so collectd 25335 root mem REG 254,0 1375536 262194 /lib/libc-2.7.so collectd 25335 root mem REG 254,0 14616 262198 /lib/libdl-2.7.so collectd 25335 root mem REG 254,0 130114 262190 /lib/libpthread-2.7.so collectd 25335 root mem REG 254,0 119288 262197 /lib/ld-2.7.so collectd 25335 root mem REG 254,0 13168 123277 /usr/lib/libgpg-error.so.0.3.0 collectd 25335 root 0r FIFO 0,6 1285434092 pipe collectd 25335 root 1w FIFO 0,6 1285434093 pipe collectd 25335 root 2w FIFO 0,6 1285434094 pipe Not much useful in there that I can see. > While looking into this I did find a path in "exec_read_one()" where the > function returned without clearing the "PL_RUNNING" flag. I don't see > how this could produce a child process, but maybe it's worth a try. The > commit is 66c0d62 ([0]). > > Regards, > —octo > [0] > <http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b> I'll give that a shot. Thanks a ton for all your help here. Ryan _______________________________________________ collectd mailing list collectd@verplant.org http://mailman.verplant.org/listinfo/collectd