On Wed, 2 Nov 2016 at 5:05pm, Reuti wrote
Just for the record: to investigate this, I defined a load_thresholds
which is always putting the queue in alarm state besides the one under
test. I used our tmpfree complex for it and entered a value which is
beyond the installed disk. This way, `qstat -explain a` will always give
an output, even the values of other complexes which aren't bypassed are
displayed. I got:
$ qstat -explain a -q serial@node29 -s r
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
serial@node29 B 0/0/16 15.75 lx24-em64t a
alarm hl:tmpfree=1842222120k load-threshold=2T
alarm hl:np_load_avg=0.492188 load-threshold=0.5
$ qstat -explain a -q serial@node29 -s r
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
serial@node29 B 0/0/16 15.75 lx24-em64t a
alarm hl:tmpfree=1842222120k load-threshold=2T
alarm hl:np_load_avg= 9.844 load-threshold=0.5
$ qstat -explain a -q serial@node29 -s r
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
serial@node29 B 0/0/16 15.76 lx24-em64t a
alarm hl:tmpfree=1842221988k load-threshold=2T
alarm hl:np_load_avg= 0.246 load-threshold=0.5
for settings of NONE or 20 and 0.5 in the load_scaling of np_load_avg of
the exechost. Looks fine. Hence your np_load_avg=2 should have worked.
The plot thickens. Doing similar testing to yours, it looks like this is
a display bug with qhost. Here are 2 configurations that both create an
alarm state, but in one the alarm doesn't show up in the output of
'qhost':
Config 1:
$ qconf -sq long.q
load_thresholds np_load_avg=0.5
$ qconf -se msg-id1
load_scaling NONE
$ qhost -q -h msg-id1
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE
SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
msg-id1 lx-amd64 48 2 24 48 24.63 251.6G 2.2G
4.0G 0.0
member.q BP 0/24/24
short.q BP 0/0/24
long.q BP 0/0/24 a
$ qstat -explain a -q long.q@msg-id1 -s r
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
lon...@msg-id1.ic.ucsf.edu BP 0/0/24 24.62 lx-amd64 a
alarm hl:np_load_avg=0.512917 load-threshold=0.5
Config 2:
$ qconf -sq long.q
load_thresholds np_load_avg=0.9
$ qconf -se msg-id1
load_scaling np_load_avg=2.000000
$ qhost -q -h msg-id1
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE
SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
msg-id1 lx-amd64 48 2 24 48 24.58 251.6G 2.2G
4.0G 0.0
member.q BP 0/24/24
short.q BP 0/0/24
long.q BP 0/0/24
$ qstat -explain a -q long.q@msg-id1 -s r
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
lon...@msg-id1.ic.ucsf.edu BP 0/0/24 24.58 lx-amd64 a
alarm hl:np_load_avg= 1.024 load-threshold=0.9
In both configs, long.q correctly refuses to accept jobs. But the qhost
display error is sure to confuse users as to why that is. I'm going to
stick with the previous solution, but I'll file a bug to try to get things
fixed up.
Thanks again for all your help, Reuti.
--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users