Are the node names in the log message really the same: "(clus201
rather than clus201)"?
The error is produced because of a string compare failing, so I would
guess there is some non-printable character in a node name. Perhaps
look at the log using "od -ax" for the non-printable character. The
relevant code is in src/slurmctld/proc_req.c (you can ignore the
comment since the node names appear to be the same):
(line 1762):
if (job_ptr && job_ptr->batch_host && comp_msg->node_name &&
strcmp(job_ptr->batch_host, comp_msg->node_name)) {
/* This can be the result of the slurmd on the batch_host
* failing, but the slurmstepd continuing to run. Then the
* batch job is requeued and started on a different node.
* The end result is one batch complete RPC from each node. */
error("Batch completion for job %u sent from wrong node "
"(%s rather than %s), ignored request",
comp_msg->job_id,
comp_msg->node_name, comp_msg->node_name);
slurm_send_rc_msg(msg, error_code);
return;
}
Quoting Dr Stuart Midgley <[email protected]>:
After about 6hours of operation, our slurmctld hangs up…
spool# ps -ef | grep slurm
root 7511 4773 0 23:56 pts/7 00:00:00 grep slurm
slurm 13841 1 0 14:22 ? 00:01:28
/d/sw/slurm/latest/sbin/slurmctld
spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************
In /var/log/messages I see
slurmctld[13841]: error: Batch completion for job 5651 sent from
wrong node (clus201 rather than clus201), ignored request
and then a whole heap of
error: slurm_receive_msgs: Socket timed out on send/recv operation
and then nothing. I’ve increased the log level to 9… and will see
what happens.
I’m running from git master at commit
6cc88535d7369a9eaacd36949e5241729461eaa2
Thanks
Stu.
--
Dr Stuart Midgley
[email protected]