Le 09/03/2016 19:12, Ryan Novosielski a écrit :
On Mar 2, 2016, at 2:51 PM, Kilian Cavalotti <[email protected]>
wrote:
On Wed, Mar 2, 2016 at 10:12 AM, <[email protected]> wrote:
We want to introduce a new behavior in the way slurmd uses the
HealthCheckProgram. The idea is to avoid a race condition between the first
HealthCheckProgram run and the node accepting jobs. The slurmd daemon will
initialize and then loop on HealthCheckProgram execution before registering
with slurmctld. It will stay in this loop until the HealthCheckProgram
returns successfully (the node is still DOWN).
Love the idea!
I do as well. I’m currently having a devil of a time getting SLURM to accept
jobs /after/ GPFS is available. So far I’ve tried a number of
dependency-related tricks with systemd and am still not getting it working
right as-yet. This would solve that and any other “not ready” problems.
That's exactly the purpose of the patch since we were facing the same
issue with IB and GPFS.
FYI it's been accepted (thank you Moe btw!) and will be available in
Slurm 16.05.0pre2:
https://github.com/SchedMD/slurm/commit/7fb0c9817abef04d324933e389fe274f20097075