Background:
I am running into issues where a job is cancelled and re-queued.  When I look 
into the slurmctld.log, I see the following relevant lines:

[2016-09-30T11:55:24.555] _slurm_rpc_submit_batch_job JobId=79707529 usec=560
[2016-09-30T12:47:15.326] Recovered JobID=79707529 State=0x0 NodeCnt=0 Assoc=0
[2016-09-30T13:12:20.090] backfill: Started JobId=79707529 in prod on compute2
[2016-09-30T13:15:35.239] Batch JobId=79707529 missing from node 0 (not found 
BatchStartTime after startup), Requeuing job
[2016-09-30T13:15:35.239] job_complete: JobID=79707529 State=0x1 NodeCnt=1 
WTERMSIG 126
[2016-09-30T13:15:35.239] job_complete: JobID=79707529 State=0x1 NodeCnt=1 
cancelled by node failure
[2016-09-30T13:15:35.239] job_complete: requeue JobID=79707529 State=0x8000 
NodeCnt=1 due to node failure
[2016-09-30T13:15:35.239] job_complete: JobID=79707529 State=0x8000 NodeCnt=1 
done
[2016-09-30T13:15:37.567] Requeuing JobID=79707529 State=0x0 NodeCnt=0

And in the corresponding compute node, I see the following in my slurmd.log:

[2016-09-30T13:12:20.679] _run_prolog: prolog with lock for job 79707529 ran 
for 0 seconds
[2016-09-30T13:22:40.106] Launching batch job 79707529 for UID 2077
[2016-09-30T13:22:41.070] [79707529] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0 status 0
[2016-09-30T13:22:41.072] [79707529] done with job

As far as I can tell, the slurm ‘server’ started the job, and then checked in 
on the job from time to time.  When the lag between the start and the launch 
time is greater than BatchStartTime, the server automatically cancels the job.

What I don’t get is it look like the job on the ‘client’ started anyway.

Questions:
1.  Does slurm NOT cancel a job it re-queues?
2.  It looks like the ‘client’ computer is slow to go from ‘start’ to ‘launch’.
a.  It looks like slurm triggers the ‘start’ of a job, but does it trigger the 
‘launch’ of a job as well?  Or is ‘launching’ done by the OS?
        b.  What can cause the lag between the ‘start’ and ‘launch’ of a job?  
How can I reduce this lag?

Thank you,
John Lin

Reply via email to