Hi Alessandro, I will update the documentation to explain this. The thought is that if the Prolog fails that would indicate some problem with a particular node and the job can be requeued to run on another node. If the PrologSlurmctld fails, the job is not going to be able to run on any node(s). These are default behaviors and either script can do something deifferent by executing the appropriate command (e.g. "scontrol requeue $SLURM_JOBID" or "scancel $SLURM_JOBID").
Moe Jette SchedMD Quoting Alessandro Italiano <[email protected]>: > > Hi > > we are going to evaluate slurm as batch system for our computing > farm[14k computing slots]. > > I've done some tests using the prolog script and I've noticed that > > 1. when the "Prolog" script fails the host, where it failed, is flagged > as DOWN > and the job will stack in PENDING status. > 2. when the "PrologSlurmctld" script fails the job is CANCELLED. > > > first of all, can someone confirm that this is the expected behavior ? > > Is there a way to configure slurm in order to automatically dispatch a > job on > a new host when the "Prolog " script fails ? > > unfortunately I didn't find any answer to my questions in the "Prolog > and Epilog Scripts" section of the slurm.conf man page > > thanks in advance > > Alessandro >
