Re: [gridengine users] MPI jobs on a multi-architecture cluster?

Dave Love Thu, 13 Dec 2012 09:18:28 -0800

<[email protected]> writes:

> I think we're running into a chipset-architecture issue (AMD vs Intel)
> in OpenMPI jobs. We're using SGE 6.2u5 and OpenMPI 1.33 with tight
> integration. All MPI jobs are launched by SGE.


At least if you're using recent AMD, you'll want to upgrade OpenMPI (and
SGE, if you don't use exclusive node access) to get binding right.  SGE
8.0.0c+ is useful for partially-full nodes.

> We've got a locally-written program that dynamically links against
> a package that's compiled with optimizations for different chipsets
> (ATLAS[2]). We've built ATLAS with multiple versions, optimized
> for each architecture in our cluster.
>
> This is fine for serial jobs--the login environment
> sets the path according to the chipset on each server
> (ie., ATLASDIR=/opt/ATLAS/3.8.3/Intel/Xeon/Westmere or
> ATLASDIR=/opt/ATLAS/3.8.3/AMD/Opteron). We do the same thing for other
> packages that provide chipset optimization (LAPACK[3] and BLAS[4]).

[If you don't mind proprietary libraries, ACML probably does better on
AMD.  Why use the netlib BLAS at all when you have a tuned one
available?]

> Our executables are dynamically linked, so there's no problem running
> the same program on either the Intel or AMD machines. Users simply submit
> the job to SGE and the executable uses the correct library for the server
> at runtime.
>
> Everything is fine if the job (MPI master process and slaves) all run
> on nodes of the same chip architecture.
>
> However, there seems to be a problem with OpenMPI jobs if the slave
> process runs on a different chipset than the master. I believe
> that the slave jobs are launched without going through a shell, so
> they don't get the environment settings that would be applied in an
> interactive session or SGE job. The slave process seems to run with
> the same paths as the parent. For example, if the master MPI job
> is launched on an Intel node, LD_LIBRARY_PATH may be set to include
> "/opt/ATLAS/3.8.3/Intel/Xeon/Westmere/lib", and this seems to be passed
> to slave MPI processes running on AMD nodes, with the result that they
> pick up the wrong library and this causes a segmentation fault.

Provide architecture-dependent directories with architecture-independent
names, e.g. /opt/ATLAS/3.8.3/Intel/Xeon/Westmere/lib stuff is
mounted/linked into /usr/local/lib/atlas in your node image for
Westmeres.

> I could set up separate MPI queues within SGE per-chipset (ie., submit jobs
> with "-pe mpi-intel" or "-pe mpi-amd"), but that adds a complication for users
> and reduces the effectiveness of SGE doing the scheduling.

That's the right thing to do under most circumstances.  I'm told by
experts that MPI programs normally don't work well with different types
of processors which unbalance the processes (and thus presumably reduce
efficiency).  It shouldn't add complication for the users.  Our PEs
aren't named for architectures, but do actually distinguish them --
e.g. by core count -- as well as fabrics.  "-pe openmpi" is re-written
to openmpi-* as usual.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] MPI jobs on a multi-architecture cluster?

Reply via email to