[slurm-dev] Re: "CG" state forever?

Moe Jette Thu, 02 Apr 2015 16:12:27 -0700

One more thing. While Slurm continually sends SIGKILL to the job (andthe kernel can't act on that), Slurm can be configured to take specialaction for non-killable processes. See the UnkillableStepProgram andUnkillableStepTimeout configuration parameters described in theslurm.conf man page.


Quoting Moe Jette <[email protected]>:

Slurm can't kill the process, so does not reallocate those resources. See:
http://slurm.schedmd.com/troubleshoot.html#completing


Quoting Michael Colonno <[email protected]>:
Hi ~
I've run into this issue with several different versions(currently 14.0.3) and I've never been able to find a root cause:Sometimes, usually when I job is canceled, the job(s) enter state"CG" and the corresponding nodes enter state "comp" or oscillatebetween "comp" and "comp*". The slurm logs show a cancelation of ajob but no other errors or issues. This zombie state persistsindefinitely. An admin has to either manually restart the slurmprocess on the affected nodes and set their state to idle to bringthem back or, in some cases, force-kill the process ID to stop theslurm process. Changing the timeout setting in the config file doesnot seem to have any effect. I am planning on updating versions tothe latest but is there anything I can do to prevent or circumventthis?
        Thanks,
        ~Mike C.
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support



--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

[slurm-dev] Re: "CG" state forever?

Reply via email to