Hi Yi,

You should create the NHC file nhc.conf only once for each type of node by nhc-genconf for an initial starting point, not at every node reinstallation.

You must then tailor nhc.conf to do only the checks that you find relevant for the given nodes. See an example in https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check

We create a global nhc.conf file with nodes name patterns selecting which tests to run on which nodes. The /etc/nhc/nhc.conf file is then distributed by rsync, just like we distribute our Slurm config files and other stuff.

/Ole

On 02/24/2017 12:44 AM, Yi Sun wrote:
Thanks very much for your reply.

My NHC version is 1.4.2 it is the correct rpm for my Centos7.2

I have the following lines in my nhc.conf but there's nothing related
to /run/user/1000 so am a bit confused. If I set the node back to idle,
it keeps coming back to drained after some time.

Thanks,
Yi

 testnode1 || check_fs_mount_rw -t "ext4" -s "/dev/vda1" -f "/"
 testnode1 || check_fs_mount_rw -t "sysfs" -s "sysfs" -f "/sys"
 testnode1 || check_fs_mount_rw -t "proc" -s "proc" -f "/proc"
 testnode1 || check_fs_mount_rw -t "devtmpfs" -s "devtmpfs" -f "/dev"
 devlogin0 || check_fs_mount_rw -t "securityfs" -s "securityfs" -f
"/sys/kernel/security"
 testnode1 || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/dev/shm"
 testnode1 || check_fs_mount_rw -t "devpts" -s "devpts" -f "/dev/pts"
 testnode1 || check_fs_mount_rw -t "tmpfs" -s "tmpfs" -f "/run"
 testnode1 || check_fs_mount_ro -t "tmpfs" -s "tmpfs" -f
"/sys/fs/cgroup"
 testnode1 || check_fs_mount_rw -t "pstore" -s "pstore" -f
"/sys/fs/pstore"
 testnode1 || check_fs_mount_rw -t "configfs" -s "configfs" -f
"/sys/kernel/config"
 testnode1 || check_fs_mount_rw -t "debugfs" -s "debugfs" -f
"/sys/kernel/debug"
 testnode1 || check_fs_mount_rw -t "hugetlbfs" -s "hugetlbfs" -f
"/dev/hugepages"
 testnode1 || check_fs_mount_rw -t "mqueue" -s "mqueue" -f "/dev/mqueue"


On Thu, 2017-02-23 at 00:35 -0800, Ole Holm Nielsen wrote:
On 02/23/2017 02:24 AM, Yi Sun wrote:
Hi,
I'm trying to add new compute nodes to existing slurm cluster.
After this, I run 'sinfo -R' on server node, the newly added nodes
showed 'NHC:check_fs_mount' and the status is drained.

I then looked at the log, it says something about /run/user/1000 is not
mounted on these new nodes. I'm a bit new to this and I'm not sure what
is happening. If I simply run scontrol update and set these new nodes
state to Resume, the status will go back to idle. Do I need to worry
about this 'check_fs_mount' issue?

You have wisely chosen to enable the Node Health Check (NHC) in Slurm.
For further information see my Wiki
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check.
  Please make sure that you have installed the latest NHC version 1.4.2,
and that it's the correct RPM package for your Linux version.

You do need to configure NHC appropriately for your servers, however, so
check the file /etc/nhc/nhc.conf.  What lines 'check_fs_mount' is in
your nhc.conf?  Which Linux OS do you use?  Which NHC version do you use?

/Ole

Reply via email to