One note: Only batch jobs will be requeued. We can't do much for jobs  
initiated by salloc or srun.


Quoting Aaron Knister <[email protected]>:

>
> Hi Mario,
>
> SLURM can and will, I believe by default, resubmit jobs that fail  
> due to node failures recognized by slurmctld that put the node in an  
> offline state. This doesnt help you, however, as SLURM doesnt appear  
> to notice these failures.
>
> I wonder if a SPANK plugin could do the job here.
>
> Sent from my iPad
>
> On Jun 19, 2013, at 12:36 PM, Mario Kadastik <[email protected]> wrote:
>
>>
>> Hi,
>>
>> I've tried to look for this, but is there any way to have automatic  
>> job resubmission in case it fails. We occasionally have hiccups for  
>> random nodes where a job might fail due to temporary network loss  
>> or loss of storage mount or what not and when users send thousands  
>> of jobs and say 0.1% fail they have to track down the individual  
>> jobs and resubmit those even though they might have had a tool that  
>> send those 5000 jobs in sequence. It would really be nice if they  
>> could just claim that they accept say 1 automatic resubmission with  
>> same initial conditions as the job got submitted. The user would  
>> know if the filesystems etc is fine with that and in our case  
>> mostly is.
>>
>> Is such a feature already in slurm or not? If yes, can you point me  
>> to documentation.
>>
>> Thanks,
>>
>> Mario Kadastik, PhD
>> Researcher
>>
>> ---
>>  "Physics is like sex, sure it may have practical reasons, but  
>> that's not why we do it"
>>     -- Richard P. Feynman

Reply via email to