Am 23.09.2011 um 23:33 schrieb Marty Dippel:

> Thanks, Ian!
> 
> I take it that "alarm" usually means something job-related (asking for
> more resources than available, for example) as opposed to something gone
> wrong in the queuing system per se.

No, it's a "problem" with the node. Please check the setting of:

$ qconf -sq all.q
...
load_threshold    NONE

The default is np_load_avg=1.75 with is more or less useless nowadays. Problem 
is, that also processes in state "D" (uninterruptible kernel task" which points 
to "waiting for disk" are there). So, a load higher than the installed cores 
times 1.75 can still be fine. [Originally it was the length of the process 
chain, i.e. number of process which are eligible to get some cpu cycles. As 
long as this number is lower than the number of installed cores, all processes 
are running at full speed (despite any set nice values), as there is noone to 
be nice to. Only with more processes than cores there is something to share)]

Especially if you have slots = cores per machine defined you need no 
load_threshold.

I think it was invented at a time, when you had big SMP machines with 256 cores 
(which is only one node to SGE) and intend to oversubscribe the node by 
intention (as you are aware of the fact, that not all parallel applications are 
really running in a linear scaling and left some cpu cycles idle). So, maybe it 
was fine to define 512 slot in the above machine. Only when you discovered that 
the load passed 1.75 you got an alarm state.

The alarm state is nothing where you should get a heart attack. It only means 
that the load_threshold was passed and therefore no more job will be scheduled 
to this machine, unless the reason for the alarm vanishes again.

I use it in combination with a load sensor from the Howto page to check whether 
the local scratch space was filled up on a node, as this could result in a 
black hole (job starts, crashes, next job start, crahses,...)

> Anyway, I'll try "-explain" - thanks!!

Looks like it can even be used without a JOBID.

-- Reuti


> Marty
> 
> 
> 
> On 9/23/11 4:22 PM, Ian Kaufman wrote:
>> On Fri, Sep 23, 2011 at 1:55 PM, Marty Dippel <[email protected]> wrote:
>>> SGE Newbie question-
>>> 
>>> When I "qstat -f" a few of the nodes return an "a" state, which I
>>> believe means the node is in alarm.
>>> 
>>> 
>>> queuename                      qtype used/tot. load_avg arch          states
>>> ----------------------------------------------------------------------------
>>> [email protected]        BIP   2/2       4.03     lx26-amd64    a
>>> 35329 0.50894 finer3a    abaezgua     r     09/23/2011 11:08:04     2
>>> 
>>> ----------------------------------------------------------------------------
>>> 
>>> 
>>> 1. What's the best way for me to discover the cause of the alarm state?
>> 
>> qstat -explain a JOBID
>> 
>>> 
>>> 2. Once a node is in alarm, will it reset by itself when the condition
>>> is corrected or will it require human intervention to clear this state?
>> 
>> Depends on if the node can clear out the job or not without human
>> intervention. Usually, its best to intervene.
>> 
>> Ian
>> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to