Check for Ethernet problems. This happens often enough that I have the 
following definition in my .bashrc file to help track these down:

alias flaky_eth='su -c "ssh slurmctld-node grep responding 
/var/log/slurm/slurmctld.log"'

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
???
Sent: Tuesday, July 21, 2020 8:41 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] lots of job failed due to node failure

Hi,all
We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of job 
failed due to node failure; check slumctld.log we found  nodes are set to down 
stat then resumed quikly.
some log info:
[2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
[2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN
[2020-07-20T00:26:23.725] error: Nodes j1802 not responding
[2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN
[2020-07-20T00:26:46.602] Node j1608 now responding
[2020-07-20T00:26:49.449] Node j1802 now responding

Anyone hit this issue beforce ?
Any suggestions will help.

Regards.

Reply via email to