Hello,
        I am new to slurm and am wondering if there is a way to put a job in 
error wait state similar to what grid engine       does.  The intension is that 
a pipeline will submit multiple jobs with dependencies.  If a parent job has an 
error (missing input file for example) I would like for the job to stay in the 
queue in error state. This way dependent jobs don’t start execution and stop 
propagation of error down stream.  There are quite a few advantages to this in 
a complex pipeline.  Is there a similar mechanism in slurm. 

        Here is a pseudo-code of what currently happens in OGE

                qsub job1
                qsub job2 -hold_jid job1
                qsub job3 -hold_jid job2


        In job1

                If (not file_exists(filename)) {
                        send email to user of missing file.
                        exit 100;
                } else {
                        proceed…
                }

                exit 0;


        Using this system job1 is set in Eqw state and user is alerted of the 
error and can fix/identify the cause of missing file. Once fixed user can clear 
error state using qmod and job1 will proceed normal execution and pipeline 
continues normally.

        Of course I could do a busy wait till the expected file is available 
but that is not ideal, as resources are tied up which could be used by other 
users.

Thank you
Jaysheel=

Reply via email to