Fixing this in version 14.11 will introduce issues that we need to add
a new configuration parameter to fix. But you could apply this to
patch and change a #DEFINE as describe in the commit to address the
issue.
https://github.com/SchedMD/slurm/commit/42dc54eac21c1f591c2d18da65e298199a11f752
Quoting [email protected]:
[email protected] (Pär Lindfors) writes:
We have just discovered a problem with 14.11.4 configured to run Prolog
at job allocation. ( PrologFlags=Alloc )
When a batch job starts the first time, the Prolog is executed on all
nodes as expected.
If the job is then requeued and restarted, the Prolog is only run on the
first node that run the batch script, and any node in the job allocation
that was not allocated to the job the first time it ran.
Further testing shows that this bug does not depend on
PrologFlags=Alloc, the same things happen without that config option.
When a job is restarted, Prolog is only run on the node running the
batch script, and on nodes that was not allocated to the job the last
time it ran.
On nodes that was allocated to the job the last time, Slurmd does not
run Prolog before running the first job step, it does not run Prolog at
all.
I have not done any detailed analysis of this case, but I would guess
something similar is causing this.
Regards,
Pär Lindfors, NSC
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support