Am 05.06.2012 um 20:23 schrieb Andrew Pearson:
> Hi all
>
> I'm having an oversubscription problem on my cluster. I'll describe the
> problem and my proposed solution. I can't implement my solution yet since
> there are some several-day jobs running right now, so I thought I'd run it
> past everyone on the mailing list.
>
> My problem is simple - parallel jobs submitted to the cluster are using
> processor cores that are already occupied by a previously submitted batch
> job. It's not clear the the parallel/batch distinction is important, but
> I've been running multiple simultaneous parallel jobs for a while on my
> current configuration and this problem has never come up before.
>
> My solution assumes that infact the problem has nothing to do with
> parallel/batch. Rather, it is happening because I have two overlapping
> queues: all.q that uses nodes 0 through 10, and all_small.q that uses nodes
> 9 and 10. The parallel job runs in all.q, while the batch job runs in
> all_small.q. Since both queues have slots=16 (16 processors per node), then
> nodes 9 and 10 effectively have 32 slots each. If this is true (that's the
> question), then all I have to do is change my queues so that they don't
> overlap.
Either this, or:
a) define slots=16 in the exechosts definition for "complex_values" (`qconf -me
node09` resp. node10)
b) define an RQS: limit hosts {node09,node10} to slots=16
to limit the overall consumption across all queues residing on an exechost.
-- Reuti
> The fact that the problem doesn't come up with multiple parallel jobs may be
> because of load thresholds.
>
> What do you think of my solution? If it's nonsense, can anyone suggest what
> the problem may be?
>
> Thank you.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users