Hi Jaysheel, There's similar functionality. Take a look at the slurm.conf man page, at the RequeueExit and RequeueExitHold parameters.
Best of luck w/your transition. Lyn On Tue, Apr 7, 2015 at 8:15 AM, Jaysheel Bhavsar <[email protected]> wrote: > > Hello, > I am new to slurm and am wondering if there is a way to put a job > in error wait state similar to what grid engine does. The intension > is that a pipeline will submit multiple jobs with dependencies. If a > parent job has an error (missing input file for example) I would like for > the job to stay in the queue in error state. This way dependent jobs don’t > start execution and stop propagation of error down stream. There are quite > a few advantages to this in a complex pipeline. Is there a similar > mechanism in slurm. > > Here is a pseudo-code of what currently happens in OGE > > qsub job1 > qsub job2 -hold_jid job1 > qsub job3 -hold_jid job2 > > > In job1 > > If (not file_exists(filename)) { > send email to user of missing file. > exit 100; > } else { > proceed… > } > > exit 0; > > > Using this system job1 is set in Eqw state and user is alerted of > the error and can fix/identify the cause of missing file. Once fixed user > can clear error state using qmod and job1 will proceed normal execution and > pipeline continues normally. > > Of course I could do a busy wait till the expected file is > available but that is not ideal, as resources are tied up which could be > used by other users. > > Thank you > Jaysheel=
