Hi
We're running Slurm 2.5.4 with Moab 6.1.10 in a single-controller
configuration. We're having incidents where slurmctld goes to 99% cpu and
doesn't respond to Moab.
I've taken straces and lsof. There's one thread which seems to be
consuming the CPU- strace shows:
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8e, 51059837148) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8d, 51059837149) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8c, 51059837150) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8b, 51059837151) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd8a, 51059837152) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd89, 51059837153) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd88, 51059837154) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd87, 51059837155) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd86, 51059837156) = -1 EPIPE (Broken pipe)
poll([{fd=5, events=POLLOUT}], 1, 10000) = 1 ([{fd=5,
revents=POLLOUT|POLLHUP}])
write(5, 0x7f9d34e4bd85, 51059837157) = -1 EPIPE (Broken pipe)
lsof is showing what I'm guessing is two connections to Moab:
slurmctld 7440 slurm 0u CHR 1,3 0t0 4764 /dev/null
slurmctld 7440 slurm 1u CHR 136,1 0t0 4 /dev/pts/1
slurmctld 7440 slurm 2u CHR 136,1 0t0 4 /dev/pts/1
slurmctld 7440 slurm 3w REG 8,1 947861216 1069214
/var/log/slurm-llnl/slurmjobcomp.log
slurmctld 7440 slurm 4wW REG 0,15 5 9981
/run/slurmctld.pid
slurmctld 7440 slurm 5u sock 0,7 0t0 542179544 can't
identify protocol
slurmctld 7440 slurm 6u IPv4 533764508 0t0 TCP *:7321
(LISTEN)
slurmctld 7440 slurm 9u IPv4 533764510 0t0 TCP *:6817
(LISTEN)
slurmctld 7440 slurm 10w REG 8,1 356488 1061839
/var/log/slurm-llnl/slurmctld.log
slurmctld 7440 slurm 20u sock 0,7 0t0 533763998 can't
identify protocol
The sockets indicating "can't identify protocol" are the ones I'm thinking
are the MWM connections. Restarting clears the condition and restores
function.
A
ny suggestions on diagnosing/debugging this?
Thanks
Michael