Hi,
I've tried to look for this, but is there any way to have automatic job
resubmission in case it fails. We occasionally have hiccups for random nodes
where a job might fail due to temporary network loss or loss of storage mount
or what not and when users send thousands of jobs and say 0.1% fail they have
to track down the individual jobs and resubmit those even though they might
have had a tool that send those 5000 jobs in sequence. It would really be nice
if they could just claim that they accept say 1 automatic resubmission with
same initial conditions as the job got submitted. The user would know if the
filesystems etc is fine with that and in our case mostly is.
Is such a feature already in slurm or not? If yes, can you point me to
documentation.
Thanks,
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why
we do it"
-- Richard P. Feynman