Am 11.10.2011 um 14:37 schrieb William Hay: > On 11 October 2011 12:55, Reuti <[email protected]> wrote: >> Am 10.10.2011 um 20:46 schrieb Gerald Ragghianti: >> >>> We have a cluster consisting of 48-core compute nodes where we need to run >>> parallel (MPI) jobs across nodes. There is a hardware limitation on the >>> QDR Infiniband cards that limits the available hardware contexts to 16 per >>> card. We have to ensure that we don't over-subscribe these hardware >>> contexts because parallel jobs without available contexts will crash. The >>> difficulty is that the contexts needed for a job are a function of the >>> number of compute nodes the job uses, not the number of job slots. >> >> When I get you right, you are seeking for something like a complex with >> "consumable HOST" (instead of JOB or YES, i.e. consume it one time on each >> used exechost independent from the total number of slots granted on this >> machine). Unfortunately it was discussed before but not implemented yet. >> >> > I don't think per host consumables would be needed. With a later > version of grid engine 2 queues should be sufficient. > 1 queue with an exclusive resource and multi-node PEs and one without > either of those. You'd have to add a slots resource at the host level > to stop the host being overloaded and possibly use a JSV to ensure all > jobs are appropriately directed. > > Unfortunately I don't think 6.1 supports exclusive resources.
Yep, that would be a possible implementation. Like the OP mentioned, one could use a consumable complex for 6.1. If you add "complex_values network=16" to the queue, and "load_thresholds network=15" it will be pushed to alarm state automatically and you can avoid the load sensor. When you add a default consumption of 1, it works out-of-the-box (it's only subtracted if it's attached to a queue). I.e. the other queue for normal jobs don't have it attached, and you select the special multi-node queue by the requested PE. And as outlined: the overall slot count per node needs to be limited on an exechost level. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
