We have a cluster consisting of 48-core compute nodes where we need to
run parallel (MPI) jobs across nodes. There is a hardware limitation on
the QDR Infiniband cards that limits the available hardware contexts to
16 per card. We have to ensure that we don't over-subscribe these
hardware contexts because parallel jobs without available contexts will
crash. The difficulty is that the contexts needed for a job are a
function of the number of compute nodes the job uses, not the number of
job slots.
We don't want to make each node dedicated to a single job because we
also want to be able to run smaller multi-threaded and single-slot jobs.
If we assume (for now) that we allow each parallel job to use all 16
contexts on each compute node, how can we ensure that no other parallel
jobs will be allocated to these nodes?
GE version: 6.1u5
--
Gerald Ragghianti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users