On 02/23/2017 02:24 AM, Yi Sun wrote:
Hi,
I'm trying to add new compute nodes to existing slurm cluster.
After this, I run 'sinfo -R' on server node, the newly added nodes
showed 'NHC:check_fs_mount' and the status is drained.
I then looked at the log, it says something about /run/user/1000 is not
mounted on these new nodes. I'm a bit new to this and I'm not sure what
is happening. If I simply run scontrol update and set these new nodes
state to Resume, the status will go back to idle. Do I need to worry
about this 'check_fs_mount' issue?
You have wisely chosen to enable the Node Health Check (NHC) in Slurm.
For further information see my Wiki
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check.
Please make sure that you have installed the latest NHC version 1.4.2,
and that it's the correct RPM package for your Linux version.
You do need to configure NHC appropriately for your servers, however, so
check the file /etc/nhc/nhc.conf. What lines 'check_fs_mount' is in
your nhc.conf? Which Linux OS do you use? Which NHC version do you use?
/Ole
--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,