Hi There,

I've recently setup a resume script that:
- Boots the nodes
- Configures the system
- Checks interconnect speeds, etc are as expected

If any of the above checks fail then the node is set to state 'FAIL' with an appropriate reason and requeues the job that was allocated to those nodes (we have multiple small clusters so a requeue is necessary).

The issue I'm having is that the jobs submitted to these machines launch their serial processes before all nodes are up. If a node were to fail to come up then the job won't actually get requeued and will die.

I came across the scontrol wait_job argument online and this seemed like the solution to my problem however the scontrol wait_job command only seems to work when I put it in the job script, if it's in the PrologSlurmctld then the job just hangs 'CONFIGURING' indefinitely and never runs (even though all nodes are up and responding). A user who had a similar issue (and sadly, no response) can be found here - https://groups.google.com/forum/#!topic/slurm-devel/Wxe0ZitpHuw <https://groups.google.com/forum/#%21topic/slurm-devel/Wxe0ZitpHuw>.

I don't think there's anything wrong with the prolog script, it's pretty simple and can be seen below:
#!/bin/bash
LOG=/tmp/prolog_$SLURM_JOBID.txt
echo "Running prolog - $(date) - $SLURM_JOBID" >> $LOG
scontrol -v wait_job $SLURM_JOBID >> $LOG
echo "End of prolog - $(date) - $SLURM_JOBID" >> $LOG

I did try setting PrologEpilogTimeout to a few minutes (instead of the default) but all this does is means that the scontrol is killed after 5 minutes.... It's as useful as having my prolog script just being a sleep statement!

Any advice would be greatly appreciated!

Cheers,

Stu
--

*Stuart Franks *

Junior IT Technician

*T:*+44 (0) 1280 840 316 *W:*www.totalsimulation.co.uk <http://www.totalsimulation.co.uk/>

*E:*stu...@totalsim.co.uk <mailto:stu...@totalsim.co.uk>*T*: @totalsimltd <https://twitter.com/TotalSimLtd>__

*TotalSim Ltd,*Top Station Road, Brackley, UK, NN13 7UG

        

Reply via email to