Hi,

I've tried to look for this, but is there any way to have automatic job 
resubmission in case it fails. We occasionally have hiccups for random nodes 
where a job might fail due to temporary network loss or loss of storage mount 
or what not and when users send thousands of jobs and say 0.1% fail they have 
to track down the individual jobs and resubmit those even though they might 
have had a tool that send those 5000 jobs in sequence. It would really be nice 
if they could just claim that they accept say 1 automatic resubmission with 
same initial conditions as the job got submitted. The user would know if the 
filesystems etc is fine with that and in our case mostly is. 

Is such a feature already in slurm or not? If yes, can you point me to 
documentation.

Thanks,

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

Reply via email to