Stu, if/when this happens again could you please attach with gdb and run
thread apply all bt full
Michael is correct that the debug message is incorrect. It is now fixed
in the master branch.
Thanks,
Danny
On 12/18/13 08:01, Dr Stuart Midgley wrote:
After about 6hours of operation, our slurmctld hangs up…
spool# ps -ef | grep slurm
root 7511 4773 0 23:56 pts/7 00:00:00 grep slurm
slurm 13841 1 0 14:22 ? 00:01:28
/d/sw/slurm/latest/sbin/slurmctld
spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
In /var/log/messages I see
slurmctld[13841]: error: Batch completion for job 5651 sent from wrong node
(clus201 rather than clus201), ignored request
and then a whole heap of
error: slurm_receive_msgs: Socket timed out on send/recv operation
and then nothing. I’ve increased the log level to 9… and will see what happens.
I’m running from git master at commit 6cc88535d7369a9eaacd36949e5241729461eaa2
Thanks
Stu.
--
Dr Stuart Midgley
[email protected]