Hi Jaysheel,

There's similar functionality.  Take a look at the slurm.conf man page, at
the RequeueExit and RequeueExitHold parameters.

Best of luck w/your transition.

Lyn

On Tue, Apr 7, 2015 at 8:15 AM, Jaysheel Bhavsar <[email protected]>
wrote:

>
> Hello,
>         I am new to slurm and am wondering if there is a way to put a job
> in error wait state similar to what grid engine       does.  The intension
> is that a pipeline will submit multiple jobs with dependencies.  If a
> parent job has an error (missing input file for example) I would like for
> the job to stay in the queue in error state. This way dependent jobs don’t
> start execution and stop propagation of error down stream.  There are quite
> a few advantages to this in a complex pipeline.  Is there a similar
> mechanism in slurm.
>
>         Here is a pseudo-code of what currently happens in OGE
>
>                 qsub job1
>                 qsub job2 -hold_jid job1
>                 qsub job3 -hold_jid job2
>
>
>         In job1
>
>                 If (not file_exists(filename)) {
>                         send email to user of missing file.
>                         exit 100;
>                 } else {
>                         proceed…
>                 }
>
>                 exit 0;
>
>
>         Using this system job1 is set in Eqw state and user is alerted of
> the error and can fix/identify the cause of missing file. Once fixed user
> can clear error state using qmod and job1 will proceed normal execution and
> pipeline continues normally.
>
>         Of course I could do a busy wait till the expected file is
> available but that is not ideal, as resources are tied up which could be
> used by other users.
>
> Thank you
> Jaysheel=

Reply via email to