Thanks - I read that section of the documentation. Unfortunately there 
doesn't seem to be a way to perform this cycle automatically. I was hoping 
there was a setting or option of the form "kill the process no matter what" or 
at least a short timeout which does so. 

        Thanks,
        ~Mike C. 

-----Original Message-----
From: Moe Jette [mailto:[email protected]] 
Sent: Thursday, April 02, 2015 3:50 PM
To: slurm-dev
Subject: [slurm-dev] Re: "CG" state forever?


Slurm can't kill the process, so does not reallocate those resources. See:
http://slurm.schedmd.com/troubleshoot.html#completing


Quoting Michael Colonno <[email protected]>:
> Hi ~
>
>       I've run into this issue with several different versions (currently
> 14.0.3) and I've never been able to find a root cause: Sometimes, 
> usually when I job is canceled, the job(s) enter state "CG" and the 
> corresponding nodes enter state "comp" or oscillate between "comp"
> and "comp*". The slurm logs show a cancelation of a job but no other 
> errors or issues. This zombie state persists indefinitely. An admin 
> has to either manually restart the slurm process on the affected nodes 
> and set their state to idle to bring them back or, in some cases, 
> force-kill the process ID to stop the slurm process. Changing the 
> timeout setting in the config file does not seem to have any effect. I 
> am planning on updating versions to the latest but is there anything I 
> can do to prevent or circumvent this?
>
>       Thanks,
>       ~Mike C.


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to