Hi Mario, SLURM can and will, I believe by default, resubmit jobs that fail due to node failures recognized by slurmctld that put the node in an offline state. This doesnt help you, however, as SLURM doesnt appear to notice these failures.
I wonder if a SPANK plugin could do the job here. Sent from my iPad On Jun 19, 2013, at 12:36 PM, Mario Kadastik <[email protected]> wrote: > > Hi, > > I've tried to look for this, but is there any way to have automatic job > resubmission in case it fails. We occasionally have hiccups for random nodes > where a job might fail due to temporary network loss or loss of storage mount > or what not and when users send thousands of jobs and say 0.1% fail they have > to track down the individual jobs and resubmit those even though they might > have had a tool that send those 5000 jobs in sequence. It would really be > nice if they could just claim that they accept say 1 automatic resubmission > with same initial conditions as the job got submitted. The user would know if > the filesystems etc is fine with that and in our case mostly is. > > Is such a feature already in slurm or not? If yes, can you point me to > documentation. > > Thanks, > > Mario Kadastik, PhD > Researcher > > --- > "Physics is like sex, sure it may have practical reasons, but that's not why > we do it" > -- Richard P. Feynman
