[slurm-dev] Re: Bizarre exit code and job re-queue on node crash

Danny Auble Thu, 08 May 2014 15:30:30 -0700


Michael, this was fixed in 2.6.4, commit d7dfa58ef.


Danny

On 05/08/2014 03:25 PM, Michael Gutteridge wrote:

Hi all..

I've run into a curious situation on our production cluster (Slurm
2.6.2, MWM 7.1.2).  It appears that sometimes when a node crashes the
job on that node is being requeued contrary to the configuration
(JobRequeue=0).  It doesn't appear that the requeue flag is being set
on the job, either.  Nor do I think Moab is at fault as it notes that
the job completed, but doesn't see the requeue.

This is curious to me as we are using the wiki2 scheduler, so it
appears that the requeue decision was made internal to Slurm (i.e.
without consulting MWM).

So... logs.  slurmctld.log:


[2014-05-07T23:24:35.143] sched: Allocate JobId=6568293
NodeList=gizmof84 #CPUs=4
...
[2014-05-08T02:53:47.852] error: Nodes gizmof84 not responding
...
[2014-05-08T02:58:01.150] Batch JobId=6568293 missing from node 0
[2014-05-08T02:58:01.150] completing job 6568293
[2014-05-08T02:58:01.150] Job 6568293 cancelled from interactive user
[2014-05-08T02:58:01.150] Requeue JobId=6568293 due to node failure
[2014-05-08T02:58:01.151] sched: job_complete for JobId=6568293
successful, exit code=4294967294
...
[2014-05-08T02:58:01.863] requeue batch job 6568293
...
[2014-05-08T02:58:11.591] _slurm_rpc_submit_batch_job JobId=6577619 usec=2788
[2014-05-08T02:58:12.024] completing job 6568293
[2014-05-08T02:58:12.025] sched: job_complete for JobId=6568293
successful, exit code=0

That exit code looks really horrible.  I'm also certain this job
wasn't cancelled by the user (assuming that is what the "cancelled
from interactive user" message means.  Is it possible that the job
record was somehow corrupted as the node failed?  Anyway, on the
slurmd.log there's indications that munge took some time getting
sorted out (communications errors), but once it does:

[2014-05-08T02:58:01.122] Purging vestigal job script
/var/tmp/slurmd/job6568293/slurm_script
[2014-05-08T02:58:11.061] reissued job credential for job 6568293
[2014-05-08T02:58:11.135] Launching batch job 6568293 for UID 45402
[2014-05-08T02:58:11.151] Received cpu frequency information for 4 cpus
[2014-05-08T02:58:11.164] [6568293] checkpoint/blcr init
[2014-05-08T02:58:12.000] [6568293] sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2014-05-08T02:58:12.007] [6568293] done with job

I don't have a solid reproducible example yet, but I thought I'd see
if this is a known issue or if anyone has any thoughts on how to get
this to properly fail the jobs.

Thanks much

Michael

[slurm-dev] Re: Bizarre exit code and job re-queue on node crash

Reply via email to