Reuti-
Thanks for that great explanation- exactly what I needed. No doubt the
load on those nodes was a bit high at the time- i"ll see if it is, going
forward.
Thanks for taking the time to explain this!
Marty
On 09/23/11 17:05, Reuti wrote:
Am 23.09.2011 um 23:33 schrieb Marty Dippel:
Thanks, Ian!
I take it that "alarm" usually means something job-related (asking for
more resources than available, for example) as opposed to something gone
wrong in the queuing system per se.
No, it's a "problem" with the node. Please check the setting of:
$ qconf -sq all.q
...
load_threshold NONE
The default is np_load_avg=1.75 with is more or less useless nowadays. Problem is, that also processes
in state "D" (uninterruptible kernel task" which points to "waiting for disk"
are there). So, a load higher than the installed cores times 1.75 can still be fine. [Originally it was
the length of the process chain, i.e. number of process which are eligible to get some cpu cycles. As
long as this number is lower than the number of installed cores, all processes are running at full speed
(despite any set nice values), as there is noone to be nice to. Only with more processes than cores
there is something to share)]
Especially if you have slots = cores per machine defined you need no
load_threshold.
I think it was invented at a time, when you had big SMP machines with 256 cores
(which is only one node to SGE) and intend to oversubscribe the node by
intention (as you are aware of the fact, that not all parallel applications are
really running in a linear scaling and left some cpu cycles idle). So, maybe it
was fine to define 512 slot in the above machine. Only when you discovered that
the load passed 1.75 you got an alarm state.
The alarm state is nothing where you should get a heart attack. It only means
that the load_threshold was passed and therefore no more job will be scheduled
to this machine, unless the reason for the alarm vanishes again.
I use it in combination with a load sensor from the Howto page to check whether
the local scratch space was filled up on a node, as this could result in a
black hole (job starts, crashes, next job start, crahses,...)
Anyway, I'll try "-explain" - thanks!!
Looks like it can even be used without a JOBID.
-- Reuti
Marty
On 9/23/11 4:22 PM, Ian Kaufman wrote:
On Fri, Sep 23, 2011 at 1:55 PM, Marty Dippel<[email protected]> wrote:
SGE Newbie question-
When I "qstat -f" a few of the nodes return an "a" state, which I
believe means the node is in alarm.
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
[email protected] BIP 2/2 4.03 lx26-amd64 a
35329 0.50894 finer3a abaezgua r 09/23/2011 11:08:04 2
----------------------------------------------------------------------------
1. What's the best way for me to discover the cause of the alarm state?
qstat -explain a JOBID
2. Once a node is in alarm, will it reset by itself when the condition
is corrected or will it require human intervention to clear this state?
Depends on if the node can clear out the job or not without human
intervention. Usually, its best to intervene.
Ian
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users