We have a cluster consisting of 48-core compute nodes where we need to run parallel (MPI) jobs across nodes. There is a hardware limitation on the QDR Infiniband cards that limits the available hardware contexts to 16 per card. We have to ensure that we don't over-subscribe these hardware contexts because parallel jobs without available contexts will crash. The difficulty is that the contexts needed for a job are a function of the number of compute nodes the job uses, not the number of job slots.

We don't want to make each node dedicated to a single job because we also want to be able to run smaller multi-threaded and single-slot jobs. If we assume (for now) that we allow each parallel job to use all 16 contexts on each compute node, how can we ensure that no other parallel jobs will be allocated to these nodes?

GE version: 6.1u5

--
Gerald Ragghianti



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to