Am 10.03.2011 um 17:59 schrieb Rayson Ho:

> LSF uses "exit rate" for that, but in SGE the load sensor has no
> knowledge of jobs running & exiting.
> 
> The way to do it in SGE (Open Grid Scheduler & Son of Grid Engine,
> etc) is to record the exit rate in a job starter (aka starter method).
> 
> And if anyone has written one already, I would like to put it up on
> the Open Grid Scheduler howto page as it is a nice feature for users
> migrating from LSF.
> 
> The starter method should be really simple, just record the exit time
> of the late few jobs, and calculate the rate of exit. If the rate is
> too high, disable the host.
> 
> http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html
> 
> "starter_method"

Isn't it already too late when the "starter_method" is started? I mean, when no 
job information can be written (e.g. to the spool area), it will never get 
executed but still trash the job.

-- Reuti


> Rayson
> 
> 
> 
> On Thu, Mar 10, 2011 at 11:47 AM, Reuti <[email protected]> wrote:
>> well, the feature to use the hawking radiation to allow the jobs to pop up 
>> on other nodes needs precise alignment of the installation -  SCNR
>> 
>> There is a demo script to check the size of e.g. /tmp here 
>> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds 
>> tmpfree=1G" in the queue definition, so that the queue instance is set to 
>> alarm state in case it falls below a certain value.
>> 
>> A load sensor can also deliver a boolean value, hence checking locally 
>> something like "all disks fine" and use this as a "load_threshold" can also 
>> be a solution. How to check something is of course specific to your node 
>> setup.
>> 
>> The last necessary piece would be to inform the admin: this could be done by 
>> the load sensor too, but as the node is known not to be in a proper state I 
>> wouldn't recommend this. Better might be a cron-job on the qmaster machine 
>> checking `qstat -explain a -qs a -u foobar` *)  to look for passed load 
>> thresholds.
>> 
>> -- Reuti
>> 
>> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown 
>> user "foobar" will help. And OTOH there is no "load_threshold" in the 
>> exechost definition.
>> 
>> 
>>> -Ed
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>> Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
>>> 
>>>> I'm looking for best practices and techniques to detect blackhole hosts 
>>>> quickly
>>>> and disable them.  ( Platform LSF has this already built in...)
>>>> 
>>>> What I see is possible is:
>>>> 
>>>> Using a cron job on a ge client node...
>>>> 
>>>> -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
>>>> -  if detected, use qmod -d '<queue_instance>' to disable
>>>> -  send email to ge_admin list
>>>> -  possibly send email of failed jobs to user(s)
>>>> 
>>>> Must be robust to be able to timeout properly when ge is down or too busy
>>>> for qmod to respond...and/or filesystem problems, etc...
>>>> 
>>>> ( perl or php alarm and sig handlers for proc_open work well for enforcing 
>>>> timeouts...)
>>>> 
>>>> Any hints would be appreciated before I start on it...
>>>> 
>>>> Won't take long to write the code, just looking for best practices and 
>>>> maybe
>>>> a setting I'm missing in the ge config...
>>> 
>>> what is causing the blackhole? For example: if it's a full file system on a 
>>> node, you could detect it by a load sensor in SGE and define in the queue 
>>> setup an alarm threshold, so that no more jobs are schedule to this 
>>> particular node.
>>> 
>>> -- Reuti
>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to