Rayson is correct that you'll need a heuristic and exit-rate based blackhole detection scheme because there are dozens of reasons why a blackhole might emerge and you can't foresee them all. So even if you check for a few blackhole reasons then you still will want to have the exit-rate detection as a fall-back. Just to prevent loosing too many jobs.
For issues you are aware of which might cause a blackhole you can use a mix of things. Load sensors have been mentioned by Reuti. That is to prevent *any* loss of jobs by *known* blackhole conditions.
Cheers, Fritz Am 10.03.11 19:01, schrieb Rayson Ho:
Reuti, I don't understand what you mean by too late... If you know for sure the disk WILL cause problems, then of course it is easy. But the problem is that the load sensor does not necessary know what to check and what will fail next, so you might need to check every disk, NFS mount, network connection, software license, etc to come up with "host_healthcheck". In LSF, the admin can define the EXIT_RATE for the host& the GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this can only be done in the starter_method, as it knows when jobs are started& when jobs exit. So a simple one would write to some sort of /tmp area, and do some math to come up with the rate. When a job exceeds the EXIT_RATE threshold, then it will close the queue/host. Rayson On Thu, Mar 10, 2011 at 12:53 PM, Reuti<[email protected]> wrote:The starter method should be really simple, just record the exit time of the late few jobs, and calculate the rate of exit. If the rate is too high, disable the host. http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html "starter_method"Isn't it already too late when the "starter_method" is started? I mean, when no job information can be written (e.g. to the spool area), it will never get executed but still trash the job. -- ReutiRayson On Thu, Mar 10, 2011 at 11:47 AM, Reuti<[email protected]> wrote:well, the feature to use the hawking radiation to allow the jobs to pop up on other nodes needs precise alignment of the installation - SCNR There is a demo script to check the size of e.g. /tmp here http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds tmpfree=1G" in the queue definition, so that the queue instance is set to alarm state in case it falls below a certain value. A load sensor can also deliver a boolean value, hence checking locally something like "all disks fine" and use this as a "load_threshold" can also be a solution. How to check something is of course specific to your node setup. The last necessary piece would be to inform the admin: this could be done by the load sensor too, but as the node is known not to be in a proper state I wouldn't recommend this. Better might be a cron-job on the qmaster machine checking `qstat -explain a -qs a -u foobar` *) to look for passed load thresholds. -- Reuti *) There is no switch "show no jobs at all" to `qstat`, so using an unknown user "foobar" will help. And OTOH there is no "load_threshold" in the exechost definition.-Ed Hi, Am 10.03.2011 um 16:50 schrieb Edward Lauzier:I'm looking for best practices and techniques to detect blackhole hosts quickly and disable them. ( Platform LSF has this already built in...) What I see is possible is: Using a cron job on a ge client node... - tail -f 1000<qmaster_messages_file> | egrep '<for_desired_string>' - if detected, use qmod -d '<queue_instance>' to disable - send email to ge_admin list - possibly send email of failed jobs to user(s) Must be robust to be able to timeout properly when ge is down or too busy for qmod to respond...and/or filesystem problems, etc... ( perl or php alarm and sig handlers for proc_open work well for enforcing timeouts...) Any hints would be appreciated before I start on it... Won't take long to write the code, just looking for best practices and maybe a setting I'm missing in the ge config...what is causing the blackhole? For example: if it's a full file system on a node, you could detect it by a load sensor in SGE and define in the queue setup an alarm threshold, so that no more jobs are schedule to this particular node. -- Reuti_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
--------------------------------------------------------------------- Notice from Univa Postmaster: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. This message has been content scanned by the Univa Mail system. ---------------------------------------------------------------------
<<attachment: fferstl.vcf>>
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
