[slurm-dev] Re: slurm daemon not running

Moe Jette Wed, 18 Dec 2013 09:21:24 -0800

Are the node names in the log message really the same: "(clus201rather than clus201)"?

The error is produced because of a string compare failing, so I wouldguess there is some non-printable character in a node name. Perhapslook at the log using "od -ax" for the non-printable character. Therelevant code is in src/slurmctld/proc_req.c (you can ignore thecomment since the node names appear to be the same):


(line 1762):
        if (job_ptr && job_ptr->batch_host && comp_msg->node_name &&
            strcmp(job_ptr->batch_host, comp_msg->node_name)) {
                /* This can be the result of the slurmd on the batch_host
                 * failing, but the slurmstepd continuing to run. Then the
                 * batch job is requeued and started on a different node.
                 * The end result is one batch complete RPC from each node. */
                error("Batch completion for job %u sent from wrong node "
                      "(%s rather than %s), ignored request",
                      comp_msg->job_id,
                      comp_msg->node_name, comp_msg->node_name);
                slurm_send_rc_msg(msg, error_code);
                return;
        }

Quoting Dr Stuart Midgley <[email protected]>:

After about 6hours of operation, our slurmctld hangs up…


spool# ps -ef | grep slurm
root      7511  4773  0 23:56 pts/7    00:00:00 grep slurm
slurm 13841 1 0 14:22 ? 00:01:28/d/sw/slurm/latest/sbin/slurmctld
spool# scontrol ping
Slurmctld(primary/backup) at bud30/(NULL) are DOWN/DOWN
*****************************************
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*****************************************


In /var/log/messages I see
slurmctld[13841]: error: Batch completion for job 5651 sent fromwrong node (clus201 rather than clus201), ignored request
and then a whole heap of

error: slurm_receive_msgs: Socket timed out on send/recv operation
and then nothing. I’ve increased the log level to 9… and will seewhat happens.
I’m running from git master at commit6cc88535d7369a9eaacd36949e5241729461eaa2
Thanks
Stu.


--
Dr Stuart Midgley
[email protected]

[slurm-dev] Re: slurm daemon not running

Reply via email to