Hi

We're running Slurm 2.5.4 with Moab 6.1.10 in a single-controller
configuration.  We're having incidents where slurmctld goes to 99% cpu and
doesn't respond to Moab.

I've taken straces and lsof.  There's one thread which seems to be
consuming the CPU- strace shows:

poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8e, 51059837148)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8d, 51059837149)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8c, 51059837150)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8b, 51059837151)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8a, 51059837152)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd89, 51059837153)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd88, 51059837154)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd87, 51059837155)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd86, 51059837156)   = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd85, 51059837157)   = -1 EPIPE (Broken pipe)

lsof is showing what I'm guessing is two connections to Moab:

slurmctld 7440 slurm    0u      CHR       1,3       0t0      4764 /dev/null
slurmctld 7440 slurm    1u      CHR     136,1       0t0         4 /dev/pts/1
slurmctld 7440 slurm    2u      CHR     136,1       0t0         4 /dev/pts/1
slurmctld 7440 slurm    3w      REG       8,1 947861216   1069214
/var/log/slurm-llnl/slurmjobcomp.log
slurmctld 7440 slurm    4wW     REG      0,15         5      9981
/run/slurmctld.pid
slurmctld 7440 slurm    5u     sock       0,7       0t0 542179544 can't
identify protocol
slurmctld 7440 slurm    6u     IPv4 533764508       0t0       TCP *:7321
(LISTEN)
slurmctld 7440 slurm    9u     IPv4 533764510       0t0       TCP *:6817
(LISTEN)
slurmctld 7440 slurm   10w      REG       8,1    356488   1061839
/var/log/slurm-llnl/slurmctld.log
slurmctld 7440 slurm   20u     sock       0,7       0t0 533763998 can't
identify protocol

The sockets indicating "can't identify protocol" are the ones I'm thinking
are the MWM connections.  Restarting clears the condition and restores
function.

A
ny suggestions on diagnosing/debugging this?

Thanks

Michael

Reply via email to