Stu, if/when this happens again could you please attach with gdb and run

thread apply all bt full

Michael is correct that the debug message is incorrect. It is now fixed in the master branch.

Thanks,
Danny

On 12/18/13 08:01, Dr Stuart Midgley wrote:
After about 6hours of operation, our slurmctld hangs up…


spool# ps -ef | grep slurm
root      7511  4773  0 23:56 pts/7    00:00:00 grep slurm
slurm    13841     1  0 14:22 ?        00:01:28 
/d/sw/slurm/latest/sbin/slurmctld

spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************


In /var/log/messages I see

slurmctld[13841]: error: Batch completion for job 5651 sent from wrong node 
(clus201 rather than clus201), ignored request

and then a whole heap of

error: slurm_receive_msgs: Socket timed out on send/recv operation


and then nothing.  I’ve increased the log level to 9… and will see what happens.

I’m running from git master at commit 6cc88535d7369a9eaacd36949e5241729461eaa2

Thanks
Stu.


--
Dr Stuart Midgley
[email protected]


Reply via email to