On Wednesday, 18 December 2013, at 09:20:38 (-0800),
Moe Jette wrote:
>
> Are the node names in the log message really the same: "(clus201
> rather than clus201)"?
>
> The error is produced because of a string compare failing, so I
> would guess there is some non-printable character in a node name.
> Perhaps look at the log using "od -ax" for the non-printable
> character. The relevant code is in src/slurmctld/proc_req.c (you can
> ignore the comment since the node names appear to be the same):
>
> (line 1762):
> if (job_ptr && job_ptr->batch_host && comp_msg->node_name &&
> strcmp(job_ptr->batch_host, comp_msg->node_name)) {
> /* This can be the result of the slurmd on the batch_host
> * failing, but the slurmstepd continuing to run. Then the
> * batch job is requeued and started on a different node.
> * The end result is one batch complete RPC from each node. */
> error("Batch completion for job %u sent from wrong node "
> "(%s rather than %s), ignored request",
> comp_msg->job_id,
> comp_msg->node_name, comp_msg->node_name);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I don't think the error message is helpful here since it's printing
out the same name twice.
Michael
--
Michael Jennings <[email protected]>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615