Hi Lyn
Thanks for a quick response. RequeueExit and RequeueExitHold will work.
Thanks
Jaysheel
> On Apr 7, 2015, at 2:30 PM, Lyn Gerner <[email protected]> wrote:
>
> Hi Jaysheel,
>
> There's similar functionality. Take a look at the slurm.conf man page, at
> the RequeueExit and RequeueExitHold parameters.
>
> Best of luck w/your transition.
>
> Lyn
>
> On Tue, Apr 7, 2015 at 8:15 AM, Jaysheel Bhavsar <[email protected]
> <mailto:[email protected]>> wrote:
>
> Hello,
> I am new to slurm and am wondering if there is a way to put a job in
> error wait state similar to what grid engine does. The intension is
> that a pipeline will submit multiple jobs with dependencies. If a parent job
> has an error (missing input file for example) I would like for the job to
> stay in the queue in error state. This way dependent jobs don’t start
> execution and stop propagation of error down stream. There are quite a few
> advantages to this in a complex pipeline. Is there a similar mechanism in
> slurm.
>
> Here is a pseudo-code of what currently happens in OGE
>
> qsub job1
> qsub job2 -hold_jid job1
> qsub job3 -hold_jid job2
>
>
> In job1
>
> If (not file_exists(filename)) {
> send email to user of missing file.
> exit 100;
> } else {
> proceed…
> }
>
> exit 0;
>
>
> Using this system job1 is set in Eqw state and user is alerted of the
> error and can fix/identify the cause of missing file. Once fixed user can
> clear error state using qmod and job1 will proceed normal execution and
> pipeline continues normally.
>
> Of course I could do a busy wait till the expected file is available
> but that is not ideal, as resources are tied up which could be used by other
> users.
>
> Thank you
> Jaysheel=
>
>