Dear slurm community,

Currently running slurm version 18.08.4

We have been experiencing an issue causing any nodes a slurm job was submitted 
to to "drain".
>From what I've seen, it appears that there is a problem with how slurm is 
>cleaning up the job with the SIGKILL process.

I've found this slurm article 
(https://slurm.schedmd.com/troubleshoot.html#completing) , which has a section 
titled "Jobs and nodes are stuck in COMPLETING state", where it recommends 
increasing the "UnkillableStepTimeout" in the slurm.conf , but all that has 
done is prolong the time it takes for the job to timeout.
The default time for the "UnkillableStepTimeout" is 60 seconds.

After the job completes, it stays in the CG (completing) status for the 60 
seconds, then the nodes the job was submitted to go to drain status.

On the headnode running slurmctld, I am seeing this in the log - 
/var/log/slurmctld:
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:40:03.000] update_node: node node001 reason set to: Kill task 
failed
[2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING

On the compute node, I am seeing this in the log - /var/log/slurmd
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:38:33.110] [1485.batch] done with job
[2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295
[2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295
[2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 1485.4294967295
[2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 1485 STEPD 
TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS 
***


I've tried restarting the SLURMD daemon on the compute nodes, and even 
completing rebooting a few computes nodes (node001, node002) .
>From what I've seen were experiencing this on all nodes in the cluster.
I've yet to restart the headnode because there are still active jobs on the 
system so I don't want to interrupt those.


Thank you for your time,
Ivan

Reply via email to