Hi,

Am 12.01.2012 um 22:07 schrieb Brendan Moloney:

> Hello,
> 
>>> {
>>>  name         shortlimit
>>>  description  NONE
>>>  enabled      TRUE
>>>  limit        queues short.q hosts * to slots=32
> 
>> I think you can leave the "hosts *" out here and the other RQS below. It 
>> means "used slots across all machines" limited to 32 in this queue. The same 
>> can be achieved by specifying only the queue.
> 
> Yes, I ended up making some things overly explicit while trying to debug the 
> issue.
> 
>>> }
>>> {
>>>  name         longlimit
>>>  description  NONE
>>>  enabled      TRUE
>>>  limit        queues long.q hosts * to slots=16
>>> }
>>> {
>>>  name         verylonglimit
>>>  description  NONE
>>>  enabled      TRUE
>>>  limit        queues verylong.q hosts * to slots=4
>>> }
>>> {
>>>  name         urgentlimit
>>>  description  NONE
>>>  enabled      TRUE
>>>  limit        users {*} queues urgent.q hosts * to slots=1
>>> }
>>> {
>>>  name         debuglimit
>>>  description  NONE
>>>  enabled      TRUE
>>>  limit        users {*} queues debug.q hosts {*} to slots=1
>>> }
> 
>> As the above 5 limits are disjunct, they can also be put in one and the same 
>> RQS. You can give each a name to get it listed instead of the number of the 
>> rule, which is always 1 right now.
> 
> I originally had these as one RQS, but again tried to make things more 
> explicit (or at least easier for me to understand) while debugging.
> 
>>> This will cause a parallel job across multiple queues to never schedule. If
>>> I get rid of the "nodelimit" and instead set the number of slots using
>>> the complex value in the host configuration, then everything works (except
>>> my debug queue).
> 
>> Do you have many machinetypes? What happens, if you don't use $num_proc 
>> there but specify a hard coded limit per hostgroup for a machinetype or so?
>> 
>> limit        queues !debug.q hosts {@quadcore} to slots=4
>> limit        queues !debug.q hosts {@hexacore} to slots=6
> 
> I don't have many machine types, in fact I don't have many machines! I tried 
> to replace the nodelimit RQS with:
> 
> {
>   name         nodelimit
>   description  NONE
>   enabled      TRUE
>   limit        queues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to 
> slots=24
>   limit        queues !debug.q hosts {piggy.ohsu.edu} to slots=8
> }
> 
> This gives the same result as the original nodelimit RQS that used $num_proc 
> (the job never gets scheduled).
> 
>>> Below I give an example of a hanging job (with the scheduler output 
>>> enabled).
>>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
>>> verylong.q. I request 40 slots as that will have to span multiple queues.
> 
>> If I get you right, SGE could find different combinations for the slot 
>> allocation, depending on the algorithm which is used as all the queues are 
>> on the same machines?
> 
> All the queues are on the same machines. I am not sure which "algorithm" you 
> refer to.

I refer to the internal algorithm of SGE how to collect slots from various 
queues.

> As mentioned, the scheduler sorts by sequence number so the queues are 
> checked in shortest to longest order.

Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots).

http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least

Does your observation fit to the aspects of parallel jobs at the end of the 
above link?

> Thus my job that requests 40 slots with the given h_rt value should take 32 
> slots from short.q and 8 slots from long.q (provided nothing else is running 
> on the cluster, which is the case for my testing).

Interesting. Collecting slots from different queues has some implications 
anyway:

- the name of the $TMPDIR depends on the name of the queue, hence it's not the 
same on all nodes
- `qrsh -inherit ...` can't distinguish between the granted queues:

https://arc.liv.ac.uk/trac/SGE/ticket/813

-- Reuti
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to