I seem to have found a combination of resource quotas that is preventing the scheduler from scheduling parallel jobs across multiple queues.
I have multiple queues for jobs with different run times: veryshort.q, short.q , long.q, and verylong.q. Each of these queues has an increasing 'h_rt' limit and an increasing sequence number (I have the scheduler sort by sequence numbers). Each of these queues also has a decreasing number slots available. Jobs are then submitted with an h_rt value and the shortest queue with an open slot is used. I also have a parallel environment "mpi" that is enabled in all of these queues. The problem only occurs if I use resource quota sets to both limit the total number of slots for the queues and limit the number of slots on each node. For example: { name nodelimit description NONE enabled TRUE limit queues !debug.q hosts {*} to slots=$num_proc } { name shortlimit description NONE enabled TRUE limit queues short.q hosts * to slots=32 } { name longlimit description NONE enabled TRUE limit queues long.q hosts * to slots=16 } { name verylonglimit description NONE enabled TRUE limit queues verylong.q hosts * to slots=4 } { name urgentlimit description NONE enabled TRUE limit users {*} queues urgent.q hosts * to slots=1 } { name debuglimit description NONE enabled TRUE limit users {*} queues debug.q hosts {*} to slots=1 } This will cause a parallel job across multiple queues to never schedule. If I get rid of the "nodelimit" and instead set the number of slots using the complex value in the host configuration, then everything works (except my debug queue). Below I give an example of a hanging job (with the scheduler output enabled). I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and verylong.q. I request 40 slots as that will have to span multiple queues. $ qsub -w e -l h_rt=3:50:00 -pe mpi 40 test.sh Your job 13280 ("test.sh") has been submitted $ qstat -u '*' job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 13280 0.00000 test.sh moloney qw 01/11/2012 21:21:32 40 $ qstat -j 13280 ============================================================== job_number: 13280 exec_file: job_scripts/13280 submission_time: Wed Jan 11 21:21:32 2012 owner: moloney ... scheduling info: cannot run in queue "debug.q" because PE "mpi" is not in pe list cannot run in queue "urgent.q" because PE "mpi" is not in pe list cannot run because it exceeds limit "////piggy/" in rule "nodelimit/1" cannot run because it exceeds limit "////piggy/" in rule "nodelimit/1" cannot run because it exceeds limit "////piggy/" in rule "nodelimit/1" cannot run because it exceeds limit "////piggy/" in rule "nodelimit/1" cannot run because it exceeds limit "////kermit/" in rule "nodelimit/1" cannot run because it exceeds limit "////kermit/" in rule "nodelimit/1" cannot run because it exceeds limit "////kermit/" in rule "nodelimit/1" cannot run because it exceeds limit "////kermit/" in rule "nodelimit/1" cannot run because it exceeds limit "////animal/" in rule "nodelimit/1" cannot run because it exceeds limit "////animal/" in rule "nodelimit/1" cannot run because it exceeds limit "////animal/" in rule "nodelimit/1" cannot run because it exceeds limit "////animal/" in rule "nodelimit/1" cannot run in PE "mpi" because it only offers 0 slots _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users