After about 6hours of operation, our slurmctld hangs up…
spool# ps -ef | grep slurm root 7511 4773 0 23:56 pts/7 00:00:00 grep slurm slurm 13841 1 0 14:22 ? 00:01:28 /d/sw/slurm/latest/sbin/slurmctld spool# scontrol ping Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN ***************************************** ** RESTORE SLURMCTLD DAEMON TO SERVICE ** ***************************************** In /var/log/messages I see slurmctld[13841]: error: Batch completion for job 5651 sent from wrong node (clus201 rather than clus201), ignored request and then a whole heap of error: slurm_receive_msgs: Socket timed out on send/recv operation and then nothing. I’ve increased the log level to 9… and will see what happens. I’m running from git master at commit 6cc88535d7369a9eaacd36949e5241729461eaa2 Thanks Stu. -- Dr Stuart Midgley [email protected]
