We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps going into a draining state: sinfo -a PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq* up infinite 1 drng node002
info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E" NODELIST CPUS(A/I/O/T) STATE MEMORY PARTITION GRES REASON node001 9/15/0/24 mix 191800 defq* gpu:1 none node002 1/0/23/24 drng 191800 defq* gpu:1 gres/gpu count changed and jobs are node003 1/23/0/24 mix 191800 defq* gpu:1 none Node of the nodes have a separate slurm.conf file, it's all shared from the head node. What else could be causing this? [2020-03-13T07:14:28.590] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T07:14:28.590] error: _slurm_rpc_node_registration node=node002: Invalid argument [2020-03-13T07:14:28.590] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:14:28.590] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.787] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.787] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.788] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T07:47:48.788] error: _slurm_rpc_node_registration node=node002: Invalid argument [2020-03-13T08:21:08.057] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T08:21:08.058] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T08:21:08.058] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T08:21:08.058] error: _slurm_rpc_node_registration node=node002: Invalid argument