One more thing. While Slurm continually sends SIGKILL to the job (and
the kernel can't act on that), Slurm can be configured to take special
action for non-killable processes. See the UnkillableStepProgram and
UnkillableStepTimeout configuration parameters described in the
slurm.conf man page.
Quoting Moe Jette <[email protected]>:
Slurm can't kill the process, so does not reallocate those resources. See:
http://slurm.schedmd.com/troubleshoot.html#completing
Quoting Michael Colonno <[email protected]>:
Hi ~
I've run into this issue with several different versions
(currently 14.0.3) and I've never been able to find a root cause:
Sometimes, usually when I job is canceled, the job(s) enter state
"CG" and the corresponding nodes enter state "comp" or oscillate
between "comp" and "comp*". The slurm logs show a cancelation of a
job but no other errors or issues. This zombie state persists
indefinitely. An admin has to either manually restart the slurm
process on the affected nodes and set their state to idle to bring
them back or, in some cases, force-kill the process ID to stop the
slurm process. Changing the timeout setting in the config file does
not seem to have any effect. I am planning on updating versions to
the latest but is there anything I can do to prevent or circumvent
this?
Thanks,
~Mike C.
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support