It's something that definitely needs to be added to Grid Engine (as a built-in feature) and we'll do that sooner rather than later but it would be a nice little project for community members who have not done code contributions thus far (hint! ;-)).

Rayson is correct that you'll need a heuristic and exit-rate based blackhole detection scheme because there are dozens of reasons why a blackhole might emerge and you can't foresee them all. So even if you check for a few blackhole reasons then you still will want to have the exit-rate detection as a fall-back. Just to prevent loosing too many jobs.

For issues you are aware of which might cause a blackhole you can use a mix of things. Load sensors have been mentioned by Reuti. That is to prevent *any* loss of jobs by *known* blackhole conditions.

Cheers,

Fritz

Am 10.03.11 19:01, schrieb Rayson Ho:
Reuti,

I don't understand what you mean by too late... If you know for sure
the disk WILL cause problems, then of course it is easy. But the
problem is that the load sensor does not necessary know what to check
and what will fail next, so you might need to check every disk, NFS
mount, network connection, software license, etc to come up with
"host_healthcheck".

In LSF, the admin can define the EXIT_RATE for the host&  the
GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this
can only be done in the starter_method, as it knows when jobs are
started&  when jobs exit. So a simple one would write to some sort of
/tmp area, and do some math to come up with the rate. When a job
exceeds the EXIT_RATE threshold, then it will close the queue/host.

Rayson


On Thu, Mar 10, 2011 at 12:53 PM, Reuti<[email protected]>  wrote:
The starter method should be really simple, just record the exit time
of the late few jobs, and calculate the rate of exit. If the rate is
too high, disable the host.

http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html

"starter_method"

Isn't it already too late when the "starter_method" is started? I mean, when no 
job information can be written (e.g. to the spool area), it will never get executed but 
still trash the job.

-- Reuti


Rayson



On Thu, Mar 10, 2011 at 11:47 AM, Reuti<[email protected]>  wrote:
well, the feature to use the hawking radiation to allow the jobs to pop up on 
other nodes needs precise alignment of the installation -  SCNR

There is a demo script to check the size of e.g. /tmp here 
http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds 
tmpfree=1G" in the queue definition, so that the queue instance is set to alarm 
state in case it falls below a certain value.

A load sensor can also deliver a boolean value, hence checking locally something like "all 
disks fine" and use this as a "load_threshold" can also be a solution. How to check 
something is of course specific to your node setup.

The last necessary piece would be to inform the admin: this could be done by 
the load sensor too, but as the node is known not to be in a proper state I 
wouldn't recommend this. Better might be a cron-job on the qmaster machine 
checking `qstat -explain a -qs a -u foobar` *)  to look for passed load 
thresholds.

-- Reuti

*) There is no switch "show no jobs at all" to `qstat`, so using an unknown user "foobar" 
will help. And OTOH there is no "load_threshold" in the exechost definition.


-Ed



Hi,

Am 10.03.2011 um 16:50 schrieb Edward Lauzier:

I'm looking for best practices and techniques to detect blackhole hosts quickly
and disable them.  ( Platform LSF has this already built in...)

What I see is possible is:

Using a cron job on a ge client node...

-  tail -f 1000<qmaster_messages_file>  | egrep '<for_desired_string>'
-  if detected, use qmod -d '<queue_instance>' to disable
-  send email to ge_admin list
-  possibly send email of failed jobs to user(s)

Must be robust to be able to timeout properly when ge is down or too busy
for qmod to respond...and/or filesystem problems, etc...

( perl or php alarm and sig handlers for proc_open work well for enforcing 
timeouts...)

Any hints would be appreciated before I start on it...

Won't take long to write the code, just looking for best practices and maybe
a setting I'm missing in the ge config...

what is causing the blackhole? For example: if it's a full file system on a 
node, you could detect it by a load sensor in SGE and define in the queue setup 
an alarm threshold, so that no more jobs are schedule to this particular node.

-- Reuti



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


---------------------------------------------------------------------


Notice from Univa Postmaster:


This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information. Any unauthorized review, use, 
disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message. This message has been content scanned by the Univa Mail 
system.



---------------------------------------------------------------------

<<attachment: fferstl.vcf>>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to