Hi,
I've done several tests and it seems that after one or two PrologSlurmctld
failures
the batch job [submitted in this way: sbatch -p debug ale.sh] is being canceled.
this is an example of the job status reported by sacct command
""""""""""""""""""""""""""""""""""""""""""""""
[root@pccms60 ~]# sacct -j 77
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
77 ale.sh debug root 1 NODE_FAIL 0:0
[root@pccms60 ~]# scontrol sho config | grep JobRequeue
JobRequeue = 1
[root@pccms60 ~]# scontrol sho node| grep State
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
[root@pccms60 ~]# scontrol -V
slurm 2.4.4
""""""""""""""""""""
Is it possible to always requeue the job upon a PrologSlurmctld failure ?
thanks in advance
Ale
On 11/19/2012 06:08 PM, Moe Jette wrote:
> I've looked at the code and it is somewhat different from what I
> thought. If the PrologSlurmctld fails then batch jobs get requeued.
> Interactive jobs (salloc and srun) will be killed.
> diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
> index f45e483..96016bd 100644
> --- a/doc/man/man5/slurm.conf.5
> +++ b/doc/man/man5/slurm.conf.5
> @@ -1229,7 +1229,10 @@ also be used to specify more than one program
> to run (e.g.
> the first job step. The prolog script or scripts may be used to purge
> files,
> enable user login, etc. By default there is no prolog. Any configured
> script
> is expected to complete execution quickly (in less time than
> -\fBMessageTimeout\fR). See \fBProlog and Epilog Scripts\fR for more
> information.
> +\fBMessageTimeout\fR).
> +If the prolog fails (returns a non\-zero exit code), this will result in the
> +node being set to a DOWN state and the job requeued to executed on
> another node.
> +See \fBProlog and Epilog Scripts\fR for more information.
>
> .TP
> \fBPrologSlurmctld\fR
> @@ -1250,7 +1253,7 @@ If some node can not be made available for use,
> the program should drain
> the node (typically using the scontrol command) and terminate with a
> non\-zero
> exit code.
> A non\-zero exit code will result in the job being requeued (where
> possible)
> -or killed.
> +or killed. Note that only batch jobs can be requeued.
> See \fBProlog and Epilog Scripts\fR for more information.
>
> .TP
>
>
>
> Quoting Alessandro Italiano <[email protected]>:
>
>> Hi
>>
>> we are going to evaluate slurm as batch system for our computing
>> farm[14k computing slots].
>>
>> I've done some tests using the prolog script and I've noticed that
>>
>> 1. when the "Prolog" script fails the host, where it failed, is flagged
>> as DOWN
>> and the job will stack in PENDING status.
>> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>>
>>
>> first of all, can someone confirm that this is the expected behavior ?
>>
>> Is there a way to configure slurm in order to automatically dispatch a
>> job on
>> a new host when the "Prolog " script fails ?
>>
>> unfortunately I didn't find any answer to my questions in the "Prolog
>> and Epilog Scripts" section of the slurm.conf man page
>>
>> thanks in advance
>>
>> Alessandro
>>