Hi Lyn
        Thanks for a quick response.  RequeueExit and RequeueExitHold will work.

Thanks
Jaysheel

> On Apr 7, 2015, at 2:30 PM, Lyn Gerner <[email protected]> wrote:
> 
> Hi Jaysheel,
> 
> There's similar functionality.  Take a look at the slurm.conf man page, at 
> the RequeueExit and RequeueExitHold parameters.  
> 
> Best of luck w/your transition.
> 
> Lyn
> 
> On Tue, Apr 7, 2015 at 8:15 AM, Jaysheel Bhavsar <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hello,
>         I am new to slurm and am wondering if there is a way to put a job in 
> error wait state similar to what grid engine       does.  The intension is 
> that a pipeline will submit multiple jobs with dependencies.  If a parent job 
> has an error (missing input file for example) I would like for the job to 
> stay in the queue in error state. This way dependent jobs don’t start 
> execution and stop propagation of error down stream.  There are quite a few 
> advantages to this in a complex pipeline.  Is there a similar mechanism in 
> slurm.
> 
>         Here is a pseudo-code of what currently happens in OGE
> 
>                 qsub job1
>                 qsub job2 -hold_jid job1
>                 qsub job3 -hold_jid job2
> 
> 
>         In job1
> 
>                 If (not file_exists(filename)) {
>                         send email to user of missing file.
>                         exit 100;
>                 } else {
>                         proceed…
>                 }
> 
>                 exit 0;
> 
> 
>         Using this system job1 is set in Eqw state and user is alerted of the 
> error and can fix/identify the cause of missing file. Once fixed user can 
> clear error state using qmod and job1 will proceed normal execution and 
> pipeline continues normally.
> 
>         Of course I could do a busy wait till the expected file is available 
> but that is not ideal, as resources are tied up which could be used by other 
> users.
> 
> Thank you
> Jaysheel=
> 
> 

Reply via email to