Are the node names in the log message really the same: "(clus201 rather than clus201)"?

The error is produced because of a string compare failing, so I would guess there is some non-printable character in a node name. Perhaps look at the log using "od -ax" for the non-printable character. The relevant code is in src/slurmctld/proc_req.c (you can ignore the comment since the node names appear to be the same):

(line 1762):
        if (job_ptr && job_ptr->batch_host && comp_msg->node_name &&
            strcmp(job_ptr->batch_host, comp_msg->node_name)) {
                /* This can be the result of the slurmd on the batch_host
                 * failing, but the slurmstepd continuing to run. Then the
                 * batch job is requeued and started on a different node.
                 * The end result is one batch complete RPC from each node. */
                error("Batch completion for job %u sent from wrong node "
                      "(%s rather than %s), ignored request",
                      comp_msg->job_id,
                      comp_msg->node_name, comp_msg->node_name);
                slurm_send_rc_msg(msg, error_code);
                return;
        }

Quoting Dr Stuart Midgley <[email protected]>:


After about 6hours of operation, our slurmctld hangs up…


spool# ps -ef | grep slurm
root      7511  4773  0 23:56 pts/7    00:00:00 grep slurm
slurm 13841 1 0 14:22 ? 00:01:28 /d/sw/slurm/latest/sbin/slurmctld

spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************


In /var/log/messages I see

slurmctld[13841]: error: Batch completion for job 5651 sent from wrong node (clus201 rather than clus201), ignored request

and then a whole heap of

error: slurm_receive_msgs: Socket timed out on send/recv operation


and then nothing. I’ve increased the log level to 9… and will see what happens.

I’m running from git master at commit 6cc88535d7369a9eaacd36949e5241729461eaa2

Thanks
Stu.


--
Dr Stuart Midgley
[email protected]




Reply via email to