Hi There,
I've recently setup a resume script that:
- Boots the nodes
- Configures the system
- Checks interconnect speeds, etc are as expected
If any of the above checks fail then the node is set to state 'FAIL'
with an appropriate reason and requeues the job that was allocated to
those nodes (we have multiple small clusters so a requeue is necessary).
The issue I'm having is that the jobs submitted to these machines launch
their serial processes before all nodes are up. If a node were to fail
to come up then the job won't actually get requeued and will die.
I came across the scontrol wait_job argument online and this seemed like
the solution to my problem however the scontrol wait_job command only
seems to work when I put it in the job script, if it's in the
PrologSlurmctld then the job just hangs 'CONFIGURING' indefinitely and
never runs (even though all nodes are up and responding). A user who had
a similar issue (and sadly, no response) can be found here -
https://groups.google.com/forum/#!topic/slurm-devel/Wxe0ZitpHuw
<https://groups.google.com/forum/#%21topic/slurm-devel/Wxe0ZitpHuw>.
I don't think there's anything wrong with the prolog script, it's pretty
simple and can be seen below:
#!/bin/bash
LOG=/tmp/prolog_$SLURM_JOBID.txt
echo "Running prolog - $(date) - $SLURM_JOBID" >> $LOG
scontrol -v wait_job $SLURM_JOBID >> $LOG
echo "End of prolog - $(date) - $SLURM_JOBID" >> $LOG
I did try setting PrologEpilogTimeout to a few minutes (instead of the
default) but all this does is means that the scontrol is killed after 5
minutes.... It's as useful as having my prolog script just being a sleep
statement!
Any advice would be greatly appreciated!
Cheers,
Stu
--
*Stuart Franks *
Junior IT Technician
*T:*+44 (0) 1280 840 316 *W:*www.totalsimulation.co.uk
<http://www.totalsimulation.co.uk/>
*E:*stu...@totalsim.co.uk <mailto:stu...@totalsim.co.uk>*T*:
@totalsimltd <https://twitter.com/TotalSimLtd>__
*TotalSim Ltd,*Top Station Road, Brackley, UK, NN13 7UG