Am 05.06.2012 um 20:23 schrieb Andrew Pearson:

> Hi all
> 
> I'm having an oversubscription problem on my cluster.  I'll describe the 
> problem and my proposed solution.  I can't implement my solution yet since 
> there are some several-day jobs running right now, so I thought I'd run it 
> past everyone on the mailing list.
> 
> My problem is simple - parallel jobs submitted to the cluster are using 
> processor cores that are already occupied by a previously submitted batch 
> job.  It's not clear the the parallel/batch distinction is important, but 
> I've been running multiple simultaneous parallel jobs for a while on my 
> current configuration and this problem has never come up before.
> 
> My solution assumes that infact the problem has nothing to do with 
> parallel/batch.  Rather, it is happening because I have two overlapping 
> queues:  all.q that uses nodes 0 through 10, and all_small.q that uses nodes 
> 9 and 10.  The parallel job runs in all.q, while the batch job runs in 
> all_small.q.  Since both queues have slots=16 (16 processors per node), then 
> nodes 9 and 10 effectively have 32 slots each.  If this is true (that's the 
> question), then all I have to do is change my queues so that they don't 
> overlap.

Either this, or:

a) define slots=16 in the exechosts definition for "complex_values" (`qconf -me 
node09` resp. node10)

b) define an RQS: limit hosts {node09,node10} to slots=16

to limit the overall consumption across all queues residing on an exechost.

-- Reuti


>  The fact that the problem doesn't come up with multiple parallel jobs may be 
> because of load thresholds.
> 
> What do you think of my solution?  If it's nonsense, can anyone suggest what 
> the problem may be?
> 
> Thank you.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to