On Thu, 10 Mar 2011, Rayson Ho wrote: ...
In LSF, the admin can define the EXIT_RATE for the host & the GLOBAL_EXIT_RATE rate for the whole cluster. In SGE the way to do this can only be done in the starter_method, as it knows when jobs are started & when jobs exit. So a simple one would write to some sort of /tmp area, and do some math to come up with the rate. When a job exceeds the EXIT_RATE threshold, then it will close the queue/host.
...
Good idea. For what it's worth, I would do this in the prolog or epilog, not in the starter_method. For the following reasons:
1) If using the prolog, you should be able to disable the node then exit 99 to cause the job to reschedule to another node.
2) If using the epilog, you have the opportunity to disable a node immediately after a job has caused a problem (which can aid scheduling efficiency if you have a lot of jobs requesting Resource Reservations) instead of detecting at job start. You do get the odd sacrificial job if a problem spontaneously develops, though.
3) Writing a starter_method that can cope with all the different ways that tightly-integrated MPI implementations start can be fiddly. [if anyone's interested, sticking an eval "$@" at the bottom of the script solves most of them]
On our GE 6.0 cluster, we put in a check/disable at job end. I have lost count of the number of times this has saved queued jobs from vanishing down a plughole over the past 5+ years. Bizarrely, it also meant that the users got the impression that the cluster was very highly stable, despite it having relatively flaky Myrinet 2000 hardware/drivers on it :)
Mark -- ----------------------------------------------------------------- Mark Dixon Email : [email protected] HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK ----------------------------------------------------------------- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
