After about 6hours of operation, our slurmctld hangs up…

spool# ps -ef | grep slurm
root      7511  4773  0 23:56 pts/7    00:00:00 grep slurm
slurm    13841     1  0 14:22 ?        00:01:28 
/d/sw/slurm/latest/sbin/slurmctld

spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************


In /var/log/messages I see

slurmctld[13841]: error: Batch completion for job 5651 sent from wrong node 
(clus201 rather than clus201), ignored request

and then a whole heap of

error: slurm_receive_msgs: Socket timed out on send/recv operation


and then nothing.  I’ve increased the log level to 9… and will see what happens.

I’m running from git master at commit 6cc88535d7369a9eaacd36949e5241729461eaa2

Thanks
Stu.


--
Dr Stuart Midgley
[email protected]


Reply via email to