Hi, Am 12.01.2012 um 22:07 schrieb Brendan Moloney:
> Hello, > >>> { >>> name shortlimit >>> description NONE >>> enabled TRUE >>> limit queues short.q hosts * to slots=32 > >> I think you can leave the "hosts *" out here and the other RQS below. It >> means "used slots across all machines" limited to 32 in this queue. The same >> can be achieved by specifying only the queue. > > Yes, I ended up making some things overly explicit while trying to debug the > issue. > >>> } >>> { >>> name longlimit >>> description NONE >>> enabled TRUE >>> limit queues long.q hosts * to slots=16 >>> } >>> { >>> name verylonglimit >>> description NONE >>> enabled TRUE >>> limit queues verylong.q hosts * to slots=4 >>> } >>> { >>> name urgentlimit >>> description NONE >>> enabled TRUE >>> limit users {*} queues urgent.q hosts * to slots=1 >>> } >>> { >>> name debuglimit >>> description NONE >>> enabled TRUE >>> limit users {*} queues debug.q hosts {*} to slots=1 >>> } > >> As the above 5 limits are disjunct, they can also be put in one and the same >> RQS. You can give each a name to get it listed instead of the number of the >> rule, which is always 1 right now. > > I originally had these as one RQS, but again tried to make things more > explicit (or at least easier for me to understand) while debugging. > >>> This will cause a parallel job across multiple queues to never schedule. If >>> I get rid of the "nodelimit" and instead set the number of slots using >>> the complex value in the host configuration, then everything works (except >>> my debug queue). > >> Do you have many machinetypes? What happens, if you don't use $num_proc >> there but specify a hard coded limit per hostgroup for a machinetype or so? >> >> limit queues !debug.q hosts {@quadcore} to slots=4 >> limit queues !debug.q hosts {@hexacore} to slots=6 > > I don't have many machine types, in fact I don't have many machines! I tried > to replace the nodelimit RQS with: > > { > name nodelimit > description NONE > enabled TRUE > limit queues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to > slots=24 > limit queues !debug.q hosts {piggy.ohsu.edu} to slots=8 > } > > This gives the same result as the original nodelimit RQS that used $num_proc > (the job never gets scheduled). > >>> Below I give an example of a hanging job (with the scheduler output >>> enabled). >>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and >>> verylong.q. I request 40 slots as that will have to span multiple queues. > >> If I get you right, SGE could find different combinations for the slot >> allocation, depending on the algorithm which is used as all the queues are >> on the same machines? > > All the queues are on the same machines. I am not sure which "algorithm" you > refer to. I refer to the internal algorithm of SGE how to collect slots from various queues. > As mentioned, the scheduler sorts by sequence number so the queues are > checked in shortest to longest order. Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots). http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least Does your observation fit to the aspects of parallel jobs at the end of the above link? > Thus my job that requests 40 slots with the given h_rt value should take 32 > slots from short.q and 8 slots from long.q (provided nothing else is running > on the cluster, which is the case for my testing). Interesting. Collecting slots from different queues has some implications anyway: - the name of the $TMPDIR depends on the name of the queue, hence it's not the same on all nodes - `qrsh -inherit ...` can't distinguish between the granted queues: https://arc.liv.ac.uk/trac/SGE/ticket/813 -- Reuti _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users