Hi ~

        I've run into this issue with several different versions (currently 
14.0.3) and I've never been able to find a root cause: Sometimes, usually when 
I job is canceled, the job(s) enter state "CG" and the corresponding nodes 
enter state "comp" or oscillate between "comp" and "comp*". The slurm logs show 
a cancelation of a job but no other errors or issues. This zombie state 
persists indefinitely. An admin has to either manually restart the slurm 
process on the affected nodes and set their state to idle to bring them back 
or, in some cases, force-kill the process ID to stop the slurm process. 
Changing the timeout setting in the config file does not seem to have any 
effect. I am planning on updating versions to the latest but is there anything 
I can do to prevent or circumvent this? 

        Thanks,
        ~Mike C. 

Reply via email to