Hi, Am 13.12.2012 um 00:27 schrieb [email protected]:
> I've got a question that's very similar to Joseph Farran's query "How do I > request the CPU type in qrsh / qsub with SGE 8.1.2?" [1], but which is a > problem specifically with MPI jobs. > > I think we're running into a chipset-architecture issue (AMD vs Intel) > in OpenMPI jobs. We're using SGE 6.2u5 and OpenMPI 1.33 with tight > integration. All MPI jobs are launched by SGE. > > We've got a locally-written program that dynamically links against > a package that's compiled with optimizations for different chipsets > (ATLAS[2]). We've built ATLAS with multiple versions, optimized > for each architecture in our cluster. > > This is fine for serial jobs--the login environment > sets the path according to the chipset on each server > (ie., ATLASDIR=/opt/ATLAS/3.8.3/Intel/Xeon/Westmere or > ATLASDIR=/opt/ATLAS/3.8.3/AMD/Opteron). We do the same thing for other > packages that provide chipset optimization (LAPACK[3] and BLAS[4]). > > Our executables are dynamically linked, so there's no problem running > the same program on either the Intel or AMD machines. Users simply submit > the job to SGE and the executable uses the correct library for the server > at runtime. > > Everything is fine if the job (MPI master process and slaves) all run > on nodes of the same chip architecture. > > However, there seems to be a problem with OpenMPI jobs if the slave > process runs on a different chipset than the master. I believe > that the slave jobs are launched without going through a shell, so > they don't get the environment settings that would be applied in an > interactive session or SGE job. The slave process seems to run with > the same paths as the parent. For example, if the master MPI job > is launched on an Intel node, LD_LIBRARY_PATH may be set to include > "/opt/ATLAS/3.8.3/Intel/Xeon/Westmere/lib", and this seems to be passed > to slave MPI processes running on AMD nodes, with the result that they > pick up the wrong library and this causes a segmentation fault. > > I could set up separate MPI queues within SGE per-chipset (ie., submit jobs > with "-pe mpi-intel" or "-pe mpi-amd"), but that adds a complication for users > and reduces the effectiveness of SGE doing the scheduling. Are you really sure to compute on different types of CPUs? As I noticed small variations already in the output of different Xeon CPUs (while I didn't investigate this further to limit it to the CPU or the same used library [maybe with a different execution path]), having a different combination of machines in each run makes it hard to reproduce exactly the results you got. For one and the same computation I would stay on one type of CPU. As Dave already pointed out, this is like limiting it to one fabric: http://www.gridengine.info/2006/02/14/grouping-jobs-to-nodes-via-wildcard-pes/ (nowadays you can even stay with one queue, and attach different PEs to different hostgroups just in one definition). -- Reuti > I'm wondering if there's a way to force SGE to select slave nodes from the > same architecture type as the master MPI process, at run-time. We've already > got the architecture as an attribute within SGE. In other words, when SGE > determines which nodes have resources available to make up the "machine list" > passed to OpenMPI, could that list be restricted to nodes of the same > architecture as the node that SGE selects for the master process? > Thanks, > > Mark > > [1] http://gridengine.org/pipermail/users/2012-December/005329.html > [2] http://math-atlas.sourceforge.net/ > [3] http://www.netlib.org/lapack/ > [4] http://www.netlib.org/blas/ > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
