I can tell you there are two pipes used for Slurm/Moab communications.  
One is used by Moab to get and set information in Slurm. The second is  
used by Slurm to notify Moab when something of interest happens, say a  
new job is submitted.

I'm not sure how easy this is to reproduce, but there is a slurm  
configuration parameter "DebugFlags=wiki" that will log all  
communications between Moab and Slurm and may help, although it is  
very verbose.

Quoting Michael Gutteridge <[email protected]>:

> Hi
>
> We're running Slurm 2.5.4 with Moab 6.1.10 in a single-controller
> configuration.  We're having incidents where slurmctld goes to 99% cpu and
> doesn't respond to Moab.
>
> I've taken straces and lsof.  There's one thread which seems to be
> consuming the CPU- strace shows:
>
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd8e, 51059837148)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd8d, 51059837149)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd8c, 51059837150)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd8b, 51059837151)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd8a, 51059837152)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd89, 51059837153)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd88, 51059837154)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd87, 51059837155)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd86, 51059837156)   = -1 EPIPE (Broken pipe)
> poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
> revents=POLLOUT|POLLHUP}])
> write(5, 0x7f9d34e4bd85, 51059837157)   = -1 EPIPE (Broken pipe)
>
> lsof is showing what I'm guessing is two connections to Moab:
>
> slurmctld 7440 slurm    0u      CHR       1,3       0t0      4764 /dev/null
> slurmctld 7440 slurm    1u      CHR     136,1       0t0         4 /dev/pts/1
> slurmctld 7440 slurm    2u      CHR     136,1       0t0         4 /dev/pts/1
> slurmctld 7440 slurm    3w      REG       8,1 947861216   1069214
> /var/log/slurm-llnl/slurmjobcomp.log
> slurmctld 7440 slurm    4wW     REG      0,15         5      9981
> /run/slurmctld.pid
> slurmctld 7440 slurm    5u     sock       0,7       0t0 542179544 can't
> identify protocol
> slurmctld 7440 slurm    6u     IPv4 533764508       0t0       TCP *:7321
> (LISTEN)
> slurmctld 7440 slurm    9u     IPv4 533764510       0t0       TCP *:6817
> (LISTEN)
> slurmctld 7440 slurm   10w      REG       8,1    356488   1061839
> /var/log/slurm-llnl/slurmctld.log
> slurmctld 7440 slurm   20u     sock       0,7       0t0 533763998 can't
> identify protocol
>
> The sockets indicating "can't identify protocol" are the ones I'm thinking
> are the MWM connections.  Restarting clears the condition and restores
> function.
>
> A
> ny suggestions on diagnosing/debugging this?
>
> Thanks
>
> Michael
>

Reply via email to