One note: Only batch jobs will be requeued. We can't do much for jobs
initiated by salloc or srun.
That would be fine, most of our jobs are sbatch submissions.
Quoting Aaron Knister aaron.knis...@gmail.com:
SLURM can and will, I believe by default, resubmit jobs that fail
due to node
This is from June 14:
Hi,
We have an user claiming his job was not requeued when the node failed.
Slurmctld detects the missing job when node is rebooted and slurmd sends
the registration message.
In these cases, slurmctld just call to job_complete with requeue=0 and
node_fail=1. I wonder
One note: Only batch jobs will be requeued. We can't do much for jobs
initiated by salloc or srun.
Quoting Aaron Knister aaron.knis...@gmail.com:
Hi Mario,
SLURM can and will, I believe by default, resubmit jobs that fail
due to node failures recognized by slurmctld that put the node