Hello,
I am new to slurm and am wondering if there is a way to put a job in
error wait state similar to what grid engine does. The intension is that
a pipeline will submit multiple jobs with dependencies. If a parent job has an
error (missing input file for example) I would like for the job to stay in the
queue in error state. This way dependent jobs don’t start execution and stop
propagation of error down stream. There are quite a few advantages to this in
a complex pipeline. Is there a similar mechanism in slurm.
Here is a pseudo-code of what currently happens in OGE
qsub job1
qsub job2 -hold_jid job1
qsub job3 -hold_jid job2
In job1
If (not file_exists(filename)) {
send email to user of missing file.
exit 100;
} else {
proceed…
}
exit 0;
Using this system job1 is set in Eqw state and user is alerted of the
error and can fix/identify the cause of missing file. Once fixed user can clear
error state using qmod and job1 will proceed normal execution and pipeline
continues normally.
Of course I could do a busy wait till the expected file is available
but that is not ideal, as resources are tied up which could be used by other
users.
Thank you
Jaysheel=