[slurm-dev] Re: what happens after a prolog failure

Alessandro Italiano Tue, 20 Nov 2012 03:27:32 -0800

Hi,

I've done several tests and it seems that after one or two PrologSlurmctld 
failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.


this is an example of the job status reported by sacct command

""""""""""""""""""""""""""""""""""""""""""""""
  [root@pccms60 ~]# sacct -j 77
        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77               ale.sh      debug       root          1  NODE_FAIL      0:0


[root@pccms60 ~]# scontrol sho  config | grep JobRequeue
JobRequeue              = 1

[root@pccms60 ~]# scontrol sho node| grep State
    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
    State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1

[root@pccms60 ~]# scontrol -V
slurm 2.4.4
""""""""""""""""""""


Is it possible to always requeue the job upon a PrologSlurmctld failure ?

thanks in advance

Ale

On 11/19/2012 06:08 PM, Moe Jette wrote:
> I've looked at the code and it is somewhat different from what I
> thought. If the  PrologSlurmctld fails then batch jobs get requeued.
> Interactive jobs (salloc and srun) will be killed.
> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
> index f45e483..96016bd 100644
> --- a/doc/man/man5/slurm.conf.5
> +++ b/doc/man/man5/slurm.conf.5
> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
> to run (e.g.
>    the first job step.  The prolog script or scripts may be used to purge 
> files,
>    enable user login, etc.  By default there is no prolog. Any configured 
> script
>    is expected to complete execution quickly (in less time than
> -\fBMessageTimeout\fR).  See \fBProlog and Epilog Scripts\fR for more
> information.
> +\fBMessageTimeout\fR).
> +If the prolog fails (returns a non\-zero exit code), this will result in the
> +node being set to a DOWN state and the job requeued to executed on
> another node.
> +See \fBProlog and Epilog Scripts\fR for more information.
>
>    .TP
>    \fBPrologSlurmctld\fR
> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
> the program should drain
>    the node (typically using the scontrol command) and terminate with a
> non\-zero
>    exit code.
>    A non\-zero exit code will result in the job being requeued (where 
> possible)
> -or killed.
> +or killed. Note that only batch jobs can be requeued.
>    See \fBProlog and Epilog Scripts\fR for more information.
>
>    .TP
>
>
>
> Quoting Alessandro Italiano <[email protected]>:
>
>> Hi
>>
>> we are going to evaluate slurm as batch system for our computing
>> farm[14k computing slots].
>>
>> I've done some tests using the prolog script and I've noticed that
>>
>> 1. when the "Prolog" script fails the host, where it failed, is flagged
>> as DOWN
>>       and the job will stack in PENDING status.
>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>
>>
>> first of all, can someone confirm that this is the expected behavior ?
>>
>> Is there a way to configure slurm in order to automatically dispatch a
>> job on
>> a new host when the "Prolog " script fails ?
>>
>> unfortunately I didn't find any answer to my questions in the "Prolog
>> and Epilog Scripts" section of the slurm.conf man page
>>
>> thanks in advance
>>
>> Alessandro
>>

[slurm-dev] Re: what happens after a prolog failure

Reply via email to