> One note: Only batch jobs will be requeued. We can't do much for jobs  
> initiated by salloc or srun.

That would be fine, most of our jobs are sbatch submissions. 

> Quoting Aaron Knister <[email protected]>:
>> SLURM can and will, I believe by default, resubmit jobs that fail  
>> due to node failures recognized by slurmctld that put the node in an  
>> offline state. This doesnt help you, however, as SLURM doesnt appear  
>> to notice these failures.
>> 
>> I wonder if a SPANK plugin could do the job here.

Yes, resubmit on node failure is ok, but sometimes it's the job that discovers 
it before the health check script because the job is actively using the service 
that fails while health check is run every ~5 minutes. Therefore yes it would 
be nice if it could be a flag that can be set at time of submission (it should 
be up to the user to choose if (s)he wants a resubmit or not). 

Thanks,

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

Reply via email to