Hi,
Just add a complex attribute to those hosts, called "slots" and set it
to 16. That will let SGE know that the host has 16 slots total.
E.g.
qconf -me node9
add "slots=16" on complex_values line
or
qconf -aattr exechost complex_values slots=16 node9
qconf -se node9
Darn, Reuti beat me to it! :)
Regards,
Alex
On 06/05/2012 11:23 AM, Andrew Pearson wrote:
Hi all
I'm having an oversubscription problem on my cluster. I'll describe the
problem and my proposed solution. I can't implement my solution yet
since there are some several-day jobs running right now, so I thought
I'd run it past everyone on the mailing list.
My problem is simple - parallel jobs submitted to the cluster are using
processor cores that are already occupied by a previously submitted
batch job. It's not clear the the parallel/batch distinction is
important, but I've been running multiple simultaneous parallel jobs for
a while on my current configuration and this problem has never come up
before.
My solution assumes that infact the problem has nothing to do with
parallel/batch. Rather, it is happening because I have two overlapping
queues: all.q that uses nodes 0 through 10, and all_small.q that uses
nodes 9 and 10. The parallel job runs in all.q, while the batch job
runs in all_small.q. Since both queues have slots=16 (16 processors per
node), then nodes 9 and 10 effectively have 32 slots each. If this is
true (that's the question), then all I have to do is change my queues so
that they don't overlap. The fact that the problem doesn't come up with
multiple parallel jobs may be because of load thresholds.
What do you think of my solution? If it's nonsense, can anyone suggest
what the problem may be?
--
Alex Chekholko [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users