Re: [collectd] exec plugin stuck on mutex

2010-03-25 Thread Florian Forster
Hi Ryan,

sorry for the late reply, I lost track :/

On Tue, Mar 09, 2010 at 12:24:20PM -0800, Ryan Tomayko wrote:
  such a weird child. Maybe the signal mask or UID/GID can tell me
  which step would be the next and might be the failing one.
 
 You got it:
 
 # cat /proc/25335/status
 […]
 SigBlk: 

Looks like the block mask was already reset. That's the last thing
that's done before calling exec_child.

 […]
 Uid:0   0   0   0
 Gid:0   0   0   0
 Groups: 0

Okay, so neither the supplementary groups, the GID nor the UID have been
set. This makes it likely that one of the following calls is causing
the observed behavior:

  * getpwnam_r
  * getgrnam_r (only if a group was configured)
  * setgroups

How do you do authentication on that machine? Are you using LDAP for
example?

I think the best way to know for sure where the process is hanging is by
using a version with debugging symbols available (the Debian package
provides those, for example, but you can also compile with
--enable-debug) and attach a debugger to the process once it's hanging.

Regards,
—octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/


signature.asc
Description: Digital signature
___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] exec plugin stuck on mutex

2010-03-09 Thread Florian Forster
Hi Ryan,

On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
 collectd 4.8.1, http://collectd.org/

no applicable problem has been fixed in the Exec plugin in the meantime,
so the problem should still exist in the master branch.

 Once I notice the plugin has stopped reporting, I have an extra
 process (28489) hanging around:
 
 $ pstree -apu 22935
 collectdmon,22935 -P /var/run/collectdmon.pid -- -C
 /etc/collectd/collectd.conf
   collectd,22936 -C /etc/collectd/collectd.conf -f
   collectd,28489 -C /etc/collectd/collectd.conf -f
   {collectd},22937
   {collectd},22938
   {collectd},22939
   {collectd},22940
   {collectd},22941
   {collectd},28487

 That process seems to exist only when the exec plugin is no longer
 reporting. Sometimes there's two of these processes.

This looks like the code that is supposed to spawn a new instance of the
script failed after fork(2) but before exec(2).

There are various cases in which the exec(2) is not reached in
exec_child(), but they all emit an error message. I take it there is
no error message somewhere in the logs or in syslog?

 strace reports that the extra process is sitting in a mutex. It never
 leaves this state:
 
 $ sudo strace -p 28489
 Process 28489 attached - interrupt to quit
 futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL

There is a mutex in the exec plugin, but I doubt that this is the
problem. It is held just before a thread is spawned (to set a flag) and
just before that thread exits (to reset the flag). I don't see any way
this could lead to a deadlock or starvation.

I'm much more concerned about the SIGCHLD handler and the various
waitpid(2)s in the code. I could see the controlling thread missing its
child's signal and waiting forever. This shouldn't create weird new
processes though.

 Any ideas what might be going on here or information I could provide
 to help find a root cause?

I'm a bit puzzled by the described behavior, I have to admit. Maybe you
could provide the lsof -p $PID output for one of those weird child
processes?

While looking into this I did find a path in exec_read_one() where the
function returned without clearing the PL_RUNNING flag. I don't see
how this could produce a child process, but maybe it's worth a try. The
commit is 66c0d62 ([0]).

Regards,
—octo

[0] 
http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/


signature.asc
Description: Digital signature
___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd


Re: [collectd] exec plugin stuck on mutex

2010-03-09 Thread Ryan Tomayko
On Tue, Mar 9, 2010 at 10:03 AM, Florian Forster o...@verplant.org wrote:
 Hi Ryan,

 On Tue, Mar 02, 2010 at 08:27:27PM -0800, Ryan Tomayko wrote:
     collectd 4.8.1, http://collectd.org/

 no applicable problem has been fixed in the Exec plugin in the meantime,
 so the problem should still exist in the master branch.

 Once I notice the plugin has stopped reporting, I have an extra
 process (28489) hanging around:

     $ pstree -apu 22935
     collectdmon,22935 -P /var/run/collectdmon.pid -- -C
 /etc/collectd/collectd.conf
       collectd,22936 -C /etc/collectd/collectd.conf -f
           collectd,28489 -C /etc/collectd/collectd.conf -f
           {collectd},22937
           {collectd},22938
           {collectd},22939
           {collectd},22940
           {collectd},22941
           {collectd},28487

 That process seems to exist only when the exec plugin is no longer
 reporting. Sometimes there's two of these processes.

 This looks like the code that is supposed to spawn a new instance of the
 script failed after fork(2) but before exec(2).

 There are various cases in which the exec(2) is not reached in
 exec_child(), but they all emit an error message. I take it there is
 no error message somewhere in the logs or in syslog?

I'm embarassed to admit I have not had the logs turned on. There's
very likely an error message I'm not seeing.  I'll switch them on
today and see what happens.

 strace reports that the extra process is sitting in a mutex. It never
 leaves this state:

     $ sudo strace -p 28489
     Process 28489 attached - interrupt to quit
     futex(0x7f2f7d4e8fb0, FUTEX_WAIT_PRIVATE, 2, NULL

 There is a mutex in the exec plugin, but I doubt that this is the
 problem. It is held just before a thread is spawned (to set a flag) and
 just before that thread exits (to reset the flag). I don't see any way
 this could lead to a deadlock or starvation.

 I'm much more concerned about the SIGCHLD handler and the various
 waitpid(2)s in the code. I could see the controlling thread missing its
 child's signal and waiting forever. This shouldn't create weird new
 processes though.

 Any ideas what might be going on here or information I could provide
 to help find a root cause?

 I'm a bit puzzled by the described behavior, I have to admit. Maybe you
 could provide the lsof -p $PID output for one of those weird child
 processes?

Here's the lsof output:

# lsof -p 25335
COMMANDPID USER   FD   TYPE DEVICESIZE   NODE NAME
collectd 25335 root  cwdDIR  254,04096 650195 /var/lib/collectd
collectd 25335 root  rtdDIR  254,04096  2 /
collectd 25335 root  txtREG  254,0  155960 114923 /usr/sbin/collectd
collectd 25335 root  memREG  254,0   93016 262221 /lib/libgcc_s.so.1
collectd 25335 root  memREG  254,0   72568 262188 /lib/libresolv-2.7.so
collectd 25335 root  memREG  254,0   18752 262195 /lib/libnss_dns-2.7.so
collectd 25335 root  memREG  254,0   47520 262191
/lib/libnss_files-2.7.so
collectd 25335 root  memREG  254,0  421512 122929
/usr/lib/libgcrypt.so.11.4.4
collectd 25335 root  memREG  254,0   40560 139512
/usr/lib/collectd/network.so
collectd 25335 root  memREG  254,06624 139559
/usr/lib/collectd/memory.so
collectd 25335 root  memREG  254,0   12424 139514
/usr/lib/collectd/memcached.so
collectd 25335 root  memREG  254,05288 139551
/usr/lib/collectd/load.so
collectd 25335 root  memREG  254,07208 139530
/usr/lib/collectd/interface.so
collectd 25335 root  memREG  254,0   22128 139513
/usr/lib/collectd/exec.so
collectd 25335 root  memREG  254,0   13344 139548
/usr/lib/collectd/df.so
collectd 25335 root  memREG  254,06672 139586
/usr/lib/collectd/cpu.so
collectd 25335 root  memREG  254,08040 139515
/usr/lib/collectd/logfile.so
collectd 25335 root  memREG  254,0 1375536 262194 /lib/libc-2.7.so
collectd 25335 root  memREG  254,0   14616 262198 /lib/libdl-2.7.so
collectd 25335 root  memREG  254,0  130114 262190 /lib/libpthread-2.7.so
collectd 25335 root  memREG  254,0  119288 262197 /lib/ld-2.7.so
collectd 25335 root  memREG  254,0   13168 123277
/usr/lib/libgpg-error.so.0.3.0
collectd 25335 root0r  FIFO0,6 1285434092 pipe
collectd 25335 root1w  FIFO0,6 1285434093 pipe
collectd 25335 root2w  FIFO0,6 1285434094 pipe

Not much useful in there that I can see.

 While looking into this I did find a path in exec_read_one() where the
 function returned without clearing the PL_RUNNING flag. I don't see
 how this could produce a child process, but maybe it's worth a try. The
 commit is 66c0d62 ([0]).

 Regards,
 —octo
 [0] 
 http://github.com/octo/collectd/commit/66c0d62a769d8bb363c8d19e82896d6cf5bdcc2b

I'll give that a shot. Thanks a ton for all your help here.

Ryan

___
collectd mailing list

Re: [collectd] exec plugin stuck on mutex

2010-03-09 Thread Florian Forster
Hi again,

On Tue, Mar 09, 2010 at 11:46:20AM -0800, Ryan Tomayko wrote:
 # lsof -p 25335
 […]
 collectd 25335 root0r  FIFO0,6 1285434092 pipe
 collectd 25335 root1w  FIFO0,6 1285434093 pipe
 collectd 25335 root2w  FIFO0,6 1285434094 pipe
 
 Not much useful in there that I can see.

actually, this proves that this is in fact the forked child. Since the
network socket is not open, it means that the filehandles (except the
standard ones) have been closed and the pipes have been dup'ed to
standard input, output and error.

The next interesting bit would be the content of /proc/$PID/status for
such a weird child. Maybe the signal mask or UID/GID can tell me which
step would be the next and might be the failing one.

 I'll give that a shot. Thanks a ton for all your help here.

You're welcome ;) I'm astonished again and again how hard forking a
process right actually is ;)

Regards,
—octo
-- 
Florian octo Forster
Hacker in training
GnuPG: 0x91523C3D
http://verplant.org/


signature.asc
Description: Digital signature
___
collectd mailing list
collectd@verplant.org
http://mailman.verplant.org/listinfo/collectd