Hi ~
I've run into this issue with several different versions (currently
14.0.3) and I've never been able to find a root cause: Sometimes, usually when
I job is canceled, the job(s) enter state "CG" and the corresponding nodes
enter state "comp" or oscillate between "comp" and "comp*". The slurm logs show
a cancelation of a job but no other errors or issues. This zombie state
persists indefinitely. An admin has to either manually restart the slurm
process on the affected nodes and set their state to idle to bring them back
or, in some cases, force-kill the process ID to stop the slurm process.
Changing the timeout setting in the config file does not seem to have any
effect. I am planning on updating versions to the latest but is there anything
I can do to prevent or circumvent this?
Thanks,
~Mike C.