Thanks - I read that section of the documentation. Unfortunately there
doesn't seem to be a way to perform this cycle automatically. I was hoping
there was a setting or option of the form "kill the process no matter what" or
at least a short timeout which does so.
Thanks,
~Mike C.
-----Original Message-----
From: Moe Jette [mailto:[email protected]]
Sent: Thursday, April 02, 2015 3:50 PM
To: slurm-dev
Subject: [slurm-dev] Re: "CG" state forever?
Slurm can't kill the process, so does not reallocate those resources. See:
http://slurm.schedmd.com/troubleshoot.html#completing
Quoting Michael Colonno <[email protected]>:
> Hi ~
>
> I've run into this issue with several different versions (currently
> 14.0.3) and I've never been able to find a root cause: Sometimes,
> usually when I job is canceled, the job(s) enter state "CG" and the
> corresponding nodes enter state "comp" or oscillate between "comp"
> and "comp*". The slurm logs show a cancelation of a job but no other
> errors or issues. This zombie state persists indefinitely. An admin
> has to either manually restart the slurm process on the affected nodes
> and set their state to idle to bring them back or, in some cases,
> force-kill the process ID to stop the slurm process. Changing the
> timeout setting in the config file does not seem to have any effect. I
> am planning on updating versions to the latest but is there anything I
> can do to prevent or circumvent this?
>
> Thanks,
> ~Mike C.
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support