[slurm-dev] Re: Resubmit on failure

2013-06-20 Thread Mario Kadastik
One note: Only batch jobs will be requeued. We can't do much for jobs initiated by salloc or srun. That would be fine, most of our jobs are sbatch submissions. Quoting Aaron Knister aaron.knis...@gmail.com: SLURM can and will, I believe by default, resubmit jobs that fail due to node

[slurm-dev] Re: Resubmit on failure

2013-06-20 Thread Alejandro Lucero Palau
This is from June 14: Hi, We have an user claiming his job was not requeued when the node failed. Slurmctld detects the missing job when node is rebooted and slurmd sends the registration message. In these cases, slurmctld just call to job_complete with requeue=0 and node_fail=1. I wonder

[slurm-dev] Re: Resubmit on failure

2013-06-19 Thread Moe Jette
One note: Only batch jobs will be requeued. We can't do much for jobs initiated by salloc or srun. Quoting Aaron Knister aaron.knis...@gmail.com: Hi Mario, SLURM can and will, I believe by default, resubmit jobs that fail due to node failures recognized by slurmctld that put the node