We have a heterogeneous grid with two types of execution hosts -- some newer nodes with 32 cores and 2Gb of memory per core and some older nodes with two cores and 2-to-4 Gb of memory per core.
We have an engineer who submits parallel jobs using MPICH2 to the grid. While we have not yet figured out exactly what is happening, the following is empirically observable: If the 'master' node for the MPICH2 job is assigned to one of the newer nodes, everything works fine and the job runs, even if some of the older nodes are used as part of the computation. If the 'master' node for the MPICH2 job is assigned to one of the older nodes, the job dies with an error that says that 'too many files are open'. I am guessing that this is a resource issue, possibly due to the lower total amount of memory available on the older nodes. So the question is, how can I force the master node to always be one of the newer nodes? It is fine if the job uses a mix of old and new nodes -- in fact, we *want* it to, but we want the 'master' node to always be one of the newer ones. I have set up a couple of hostgroups corresponding to the old and new nodes, but if I specify the new hostgroup as a requirement, the job only runs on the new nodes and not on all of them. What sort of resource requirement can I set up that will only apply to the master node and not to all of the execution hosts?? JY _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
