One note: Only batch jobs will be requeued. We can't do much for jobs initiated by salloc or srun.
Quoting Aaron Knister <[email protected]>: > > Hi Mario, > > SLURM can and will, I believe by default, resubmit jobs that fail > due to node failures recognized by slurmctld that put the node in an > offline state. This doesnt help you, however, as SLURM doesnt appear > to notice these failures. > > I wonder if a SPANK plugin could do the job here. > > Sent from my iPad > > On Jun 19, 2013, at 12:36 PM, Mario Kadastik <[email protected]> wrote: > >> >> Hi, >> >> I've tried to look for this, but is there any way to have automatic >> job resubmission in case it fails. We occasionally have hiccups for >> random nodes where a job might fail due to temporary network loss >> or loss of storage mount or what not and when users send thousands >> of jobs and say 0.1% fail they have to track down the individual >> jobs and resubmit those even though they might have had a tool that >> send those 5000 jobs in sequence. It would really be nice if they >> could just claim that they accept say 1 automatic resubmission with >> same initial conditions as the job got submitted. The user would >> know if the filesystems etc is fine with that and in our case >> mostly is. >> >> Is such a feature already in slurm or not? If yes, can you point me >> to documentation. >> >> Thanks, >> >> Mario Kadastik, PhD >> Researcher >> >> --- >> "Physics is like sex, sure it may have practical reasons, but >> that's not why we do it" >> -- Richard P. Feynman
