Hi Mehmet,

Perhaps you need to configure NHC to use the short hostname, see the example in https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check

/Ole

On 06/19/2017 05:09 PM, Belgin, Mehmet wrote:
Thank you Loris, it was my bad. I should have used the short hostname, which seems to be working for me as well:

$ sinfo -o '%t %E' -hn `hostname -s`
$ drain Testing



On Jun 19, 2017, at 2:28 AM, Loris Bennett <[email protected] <mailto:[email protected]>> wrote:


Hi Mehmet,

"Belgin, Mehmet" <[email protected] <mailto:[email protected]>> writes:

I’m troubleshooting an issue that causes NHC to fail to offline a bad
node. The node offline script uses formatted “sinfo" to identify the
node status, which returns blank for some reason. Interestingly, sinfo
works without custom formatting.

Could this be due to a bug in the current version (17.02.4)? Would
someone mind trying the following commands in an older Slurm version
to compare the output?

[root@devel-vcomp1 nhc]# sinfo --version
slurm 17.02.4

[root@devel-vcomp1 nhc]# sinfo -o '%t %E' -hn `hostname`

(NOTHING!)

[root@devel-vcomp1 nhc]# sinfo -hn `hostname`
test up infinite 0 n/a
vtest* up infinite 0 n/a

(OK)

Thanks!

-Mehmet


Seem to work as expected with our version:

[root@node003 ~]# sinfo --version
slurm 16.05.10-2
[root@node003 ~]# sinfo -o '%t %E' -hn `hostname`
mix none
[root@node003 ~]# sinfo -hn `hostname`
test           up    3:00:00      0    n/a
main*          up 14-00:00:0      1    mix node003
gpu            up 14-00:00:0      0    n/a

HTH,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin [email protected] <mailto:[email protected]>

Reply via email to