We are looking at moving from SGE (old) to SLURM for our production
clusters. We make heavy use of task arrays and the special job error state
of 100.
SLURM now has task arrays, which is great, but as far as I can see, doesn't
support the job error state of 100 (or equivalent). Is this
planned/available? Can we pay to have it added?
Let me explain.
SGE has a special job error state of 100 (ie. exit 100) which puts the job
in E state in the queue. The job leaves the allocated node(s) and goes
back into the queue in E state. This means we can easily know which jobs
have failed, look at their log, fix the problem (usually a system problem -
like an unmounted file system or crashed ypbind) and then clear the error
and the job goes into into Q state. It then gets rescheduled back onto the
cluster.
We use this in our batch scripts like
#!/bin/bash
set -o pipefail
command1 | command2 | command3 || exit 100
If any of the commands fail, the job ends up in E state in the queue.
Thanks
--
Dr Stuart Midgley
[email protected]