Slurm can't kill the process, so does not reallocate those resources. See:
http://slurm.schedmd.com/troubleshoot.html#completing


Quoting Michael Colonno <[email protected]>:
Hi ~

I've run into this issue with several different versions (currently 14.0.3) and I've never been able to find a root cause: Sometimes, usually when I job is canceled, the job(s) enter state "CG" and the corresponding nodes enter state "comp" or oscillate between "comp" and "comp*". The slurm logs show a cancelation of a job but no other errors or issues. This zombie state persists indefinitely. An admin has to either manually restart the slurm process on the affected nodes and set their state to idle to bring them back or, in some cases, force-kill the process ID to stop the slurm process. Changing the timeout setting in the config file does not seem to have any effect. I am planning on updating versions to the latest but is there anything I can do to prevent or circumvent this?

        Thanks,
        ~Mike C.


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to