Hi all I'm having an oversubscription problem on my cluster. I'll describe the problem and my proposed solution. I can't implement my solution yet since there are some several-day jobs running right now, so I thought I'd run it past everyone on the mailing list.
My problem is simple - parallel jobs submitted to the cluster are using processor cores that are already occupied by a previously submitted batch job. It's not clear the the parallel/batch distinction is important, but I've been running multiple simultaneous parallel jobs for a while on my current configuration and this problem has never come up before. My solution assumes that infact the problem has nothing to do with parallel/batch. Rather, it is happening because I have two overlapping queues: all.q that uses nodes 0 through 10, and all_small.q that uses nodes 9 and 10. The parallel job runs in all.q, while the batch job runs in all_small.q. Since both queues have slots=16 (16 processors per node), then nodes 9 and 10 effectively have 32 slots each. If this is true (that's the question), then all I have to do is change my queues so that they don't overlap. The fact that the problem doesn't come up with multiple parallel jobs may be because of load thresholds. What do you think of my solution? If it's nonsense, can anyone suggest what the problem may be? Thank you.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
