On Wednesday, 18 December 2013, at 09:20:38 (-0800),
Moe Jette wrote:

> 
> Are the node names in the log message really the same: "(clus201
> rather than clus201)"?
> 
> The error is produced because of a string compare failing, so I
> would guess there is some non-printable character in a node name.
> Perhaps look at the log using "od -ax" for the non-printable
> character. The relevant code is in src/slurmctld/proc_req.c (you can
> ignore the comment since the node names appear to be the same):
> 
> (line 1762):
>       if (job_ptr && job_ptr->batch_host && comp_msg->node_name &&
>           strcmp(job_ptr->batch_host, comp_msg->node_name)) {
>               /* This can be the result of the slurmd on the batch_host
>                * failing, but the slurmstepd continuing to run. Then the
>                * batch job is requeued and started on a different node.
>                * The end result is one batch complete RPC from each node. */
>               error("Batch completion for job %u sent from wrong node "
>                     "(%s rather than %s), ignored request",
>                     comp_msg->job_id,
>                     comp_msg->node_name, comp_msg->node_name);
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I don't think the error message is helpful here since it's printing
out the same name twice.

Michael

-- 
Michael Jennings <[email protected]>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E        W: 510-495-2687
MS 050B-3209          F: 510-486-8615

Reply via email to