[slurm-dev] Prolog & scontrol wait_job $SLURM_JOBID

Stuart Franks Mon, 27 Jun 2016 09:16:53 -0700

Hi There,

I've recently setup a resume script that:
- Boots the nodes
- Configures the system
- Checks interconnect speeds, etc are as expected

If any of the above checks fail then the node is set to state 'FAIL'with an appropriate reason and requeues the job that was allocated tothose nodes (we have multiple small clusters so a requeue is necessary).

The issue I'm having is that the jobs submitted to these machines launchtheir serial processes before all nodes are up. If a node were to failto come up then the job won't actually get requeued and will die.

I came across the scontrol wait_job argument online and this seemed likethe solution to my problem however the scontrol wait_job command onlyseems to work when I put it in the job script, if it's in thePrologSlurmctld then the job just hangs 'CONFIGURING' indefinitely andnever runs (even though all nodes are up and responding). A user who hada similar issue (and sadly, no response) can be found here -https://groups.google.com/forum/#!topic/slurm-devel/Wxe0ZitpHuw<https://groups.google.com/forum/#%21topic/slurm-devel/Wxe0ZitpHuw>.

I don't think there's anything wrong with the prolog script, it's prettysimple and can be seen below:

#!/bin/bash
LOG=/tmp/prolog_$SLURM_JOBID.txt
echo "Running prolog - $(date) - $SLURM_JOBID" >> $LOG
scontrol -v wait_job $SLURM_JOBID >> $LOG
echo "End of prolog - $(date) - $SLURM_JOBID" >> $LOG

I did try setting PrologEpilogTimeout to a few minutes (instead of thedefault) but all this does is means that the scontrol is killed after 5minutes.... It's as useful as having my prolog script just being a sleepstatement!


Any advice would be greatly appreciated!

Cheers,

Stu
--

*Stuart Franks *

Junior IT Technician

*T:*+44 (0) 1280 840 316 *W:*www.totalsimulation.co.uk<http://www.totalsimulation.co.uk/>

*E:*stu...@totalsim.co.uk <mailto:stu...@totalsim.co.uk>*T*:@totalsimltd <https://twitter.com/TotalSimLtd>__


*TotalSim Ltd,*Top Station Road, Brackley, UK, NN13 7UG

[slurm-dev] Prolog & scontrol wait_job $SLURM_JOBID

Reply via email to