Thanks for the input guys!

We don’t even use lustre filesystems…and It doesn’t appear to be I/O.

I execute iostat on both head node and compute node when the job is in CG 
status and the %iowait value is 0.00 or 0.01

$ iostat
Linux 3.10.0-957.el7.x86_64 (node002)   07/22/2020      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    0.01    0.00    0.00   99.98

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.82        14.09         2.39    1157160     196648

Also tried the following command to see if I can identify any processes in D 
state on the compute node but no results:
ps aux | awk '$8 ~ /D/  { print $0 }'


This ones got me stumped…

Sorry I’m not too familiar with epilog yet; do you have any examples of how I 
would use that to log the SIGKILL event ?

Thanks again,
Ivan

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Paul 
Edmon
Sent: Thursday, July 23, 2020 7:19 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Nodes going into drain because of "Kill task failed"


Same here.  Whenever we see rashes of Kill task failed it is invariably 
symptomatic of one of our Lustre filesystems acting up or being saturated.

-Paul Edmon-
On 7/22/2020 3:21 PM, Ryan Cox wrote:
Angelos,

I'm glad you mentioned UnkillableStepProgram.  We meant to look at that a while 
ago but forgot about it.  That will be very useful for us as well, though the 
answer for us is pretty much always Lustre problems.

Ryan
On 7/22/20 1:02 PM, Angelos Ching wrote:
Agreed. You may also want to write a script that gather the list of program in 
"D state" (kernel wait) and print their stack; and configure it as 
UnkillableStepProgram so that you can capture the program and relevant system 
callS that caused the job to become unkillable / timed out exiting for further 
troubleshooting.

Regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)


2020/07/23 0:41、Ryan Cox <ryan_...@byu.edu><mailto:ryan_...@byu.edu>のメール:
 Ivan,

Are you having I/O slowness? That is the most common cause for us. If it's not 
that, you'll want to look through all the reasons that it takes a long time for 
a process to actually die after a SIGKILL because one of those is the likely 
cause. Typically it's because the process is waiting for an I/O syscall to 
return. Sometimes swap death is the culprit, but usually not at the scale that 
you stated.  Maybe you could try reproducing the issue manually or putting 
something in epilog the see the state of the processes in the job's cgroup.

Ryan
On 7/22/20 10:24 AM, Ivan Kovanda wrote:
Dear slurm community,

Currently running slurm version 18.08.4

We have been experiencing an issue causing any nodes a slurm job was submitted 
to to "drain".
From what I've seen, it appears that there is a problem with how slurm is 
cleaning up the job with the SIGKILL process.

I've found this slurm article 
(https://slurm.schedmd.com/troubleshoot.html#completing<https://urldefense.com/v3/__https:/slurm.schedmd.com/troubleshoot.html*completing__;Iw!!NCZxaNi9jForCP_SxBKJCA!FOsRehxg6w3PLipsOItVBSjYhPtRzmQnBUQen6C13v85kgef1cZFdtwuP9zG1sgAEQ$>)
 , which has a section titled "Jobs and nodes are stuck in COMPLETING state", 
where it recommends increasing the "UnkillableStepTimeout" in the slurm.conf , 
but all that has done is prolong the time it takes for the job to timeout.
The default time for the "UnkillableStepTimeout" is 60 seconds.

After the job completes, it stays in the CG (completing) status for the 60 
seconds, then the nodes the job was submitted to go to drain status.

On the headnode running slurmctld, I am seeing this in the log - 
/var/log/slurmctld:
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:40:03.000] update_node: node node001 reason set to: Kill task 
failed
[2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING

On the compute node, I am seeing this in the log - /var/log/slurmd
--------------------------------------------------------------------------------------------------------------------------------------------
[2020-07-21T22:38:33.110] [1485.batch] done with job
[2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295
[2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295
[2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 1485.4294967295
[2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 1485 STEPD 
TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT ENDING WITH SIGNALS 
***


I've tried restarting the SLURMD daemon on the compute nodes, and even 
completing rebooting a few computes nodes (node001, node002) .
From what I've seen were experiencing this on all nodes in the cluster.
I've yet to restart the headnode because there are still active jobs on the 
system so I don't want to interrupt those.


Thank you for your time,
Ivan


Reply via email to