I can tell you there are two pipes used for Slurm/Moab communications. One is used by Moab to get and set information in Slurm. The second is used by Slurm to notify Moab when something of interest happens, say a new job is submitted.
I'm not sure how easy this is to reproduce, but there is a slurm configuration parameter "DebugFlags=wiki" that will log all communications between Moab and Slurm and may help, although it is very verbose. Quoting Michael Gutteridge <[email protected]>: > Hi > > We're running Slurm 2.5.4 with Moab 6.1.10 in a single-controller > configuration. We're having incidents where slurmctld goes to 99% cpu and > doesn't respond to Moab. > > I've taken straces and lsof. There's one thread which seems to be > consuming the CPU- strace shows: > > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd8e, 51059837148) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd8d, 51059837149) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd8c, 51059837150) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd8b, 51059837151) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd8a, 51059837152) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd89, 51059837153) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd88, 51059837154) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd87, 51059837155) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd86, 51059837156) = -1 EPIPE (Broken pipe) > poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5, > revents=POLLOUT|POLLHUP}]) > write(5, 0x7f9d34e4bd85, 51059837157) = -1 EPIPE (Broken pipe) > > lsof is showing what I'm guessing is two connections to Moab: > > slurmctld 7440 slurm 0u CHR 1,3 0t0 4764 /dev/null > slurmctld 7440 slurm 1u CHR 136,1 0t0 4 /dev/pts/1 > slurmctld 7440 slurm 2u CHR 136,1 0t0 4 /dev/pts/1 > slurmctld 7440 slurm 3w REG 8,1 947861216 1069214 > /var/log/slurm-llnl/slurmjobcomp.log > slurmctld 7440 slurm 4wW REG 0,15 5 9981 > /run/slurmctld.pid > slurmctld 7440 slurm 5u sock 0,7 0t0 542179544 can't > identify protocol > slurmctld 7440 slurm 6u IPv4 533764508 0t0 TCP *:7321 > (LISTEN) > slurmctld 7440 slurm 9u IPv4 533764510 0t0 TCP *:6817 > (LISTEN) > slurmctld 7440 slurm 10w REG 8,1 356488 1061839 > /var/log/slurm-llnl/slurmctld.log > slurmctld 7440 slurm 20u sock 0,7 0t0 533763998 can't > identify protocol > > The sockets indicating "can't identify protocol" are the ones I'm thinking > are the MWM connections. Restarting clears the condition and restores > function. > > A > ny suggestions on diagnosing/debugging this? > > Thanks > > Michael >
