[slurm-dev] Re: what happens after a prolog failure

Moe Jette Mon, 19 Nov 2012 08:39:05 -0800

Hi Alessandro,

I will update the documentation to explain this. The thought is that  
if the Prolog fails that would indicate some problem with a particular  
node and the job can be requeued to run on another node. If the  
PrologSlurmctld fails, the job is not going to be able to run on any  
node(s). These are default behaviors and either script can do  
something deifferent by executing the appropriate command (e.g.  
"scontrol requeue $SLURM_JOBID" or "scancel $SLURM_JOBID").


Moe Jette
SchedMD

Quoting Alessandro Italiano <[email protected]>:

>
> Hi
>
> we are going to evaluate slurm as batch system for our computing
> farm[14k computing slots].
>
> I've done some tests using the prolog script and I've noticed that
>
> 1. when the "Prolog" script fails the host, where it failed, is flagged
> as DOWN
>      and the job will stack in PENDING status.
> 2. when the "PrologSlurmctld" script fails the job is CANCELLED.
>
>
> first of all, can someone confirm that this is the expected behavior ?
>
> Is there a way to configure slurm in order to automatically dispatch a
> job on
> a new host when the "Prolog " script fails ?
>
> unfortunately I didn't find any answer to my questions in the "Prolog
> and Epilog Scripts" section of the slurm.conf man page
>
> thanks in advance
>
> Alessandro
>

[slurm-dev] Re: what happens after a prolog failure

Reply via email to