Am 10.10.2011 um 20:46 schrieb Gerald Ragghianti:

> We have a cluster consisting of 48-core compute nodes where we need to run 
> parallel (MPI) jobs across nodes.  There is a hardware limitation on the QDR 
> Infiniband cards that limits the available hardware contexts to 16 per card.  
> We have to ensure that we don't over-subscribe these hardware contexts 
> because parallel jobs without available contexts will crash.  The difficulty 
> is that the contexts needed for a job are a function of the number of compute 
> nodes the job uses, not the number of job slots.

When I get you right, you are seeking for something like a complex with 
"consumable HOST" (instead of JOB or YES, i.e. consume it one time on each used 
exechost independent from the total number of slots granted on this machine). 
Unfortunately it was discussed before but not implemented yet.


> We don't want to make each node dedicated to a single job because we also 
> want to be able to run smaller multi-threaded and single-slot jobs. If we 
> assume (for now) that we allow each parallel job to use all 16 contexts on 
> each compute node, how can we ensure that no other parallel jobs will be 
> allocated to these nodes?


You mean each job may consume 1 or all 16 contexts on an exechost? How do you 
decide which case to use?

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to