Am Sun, 20 Nov 2016 15:12:47 -0800
schrieb Christopher Samuel <sam...@unimelb.edu.au>: 

> If your MPI stack properly supports Slurm shouldn't that be:
> 
> sbatch --ntasks=16  --tasks-per-node=2  --wrap 'srun ./helloWorldMPI'
> ?
> Otherwise you're at the mercy of what your mpiexec chooses to do.

If the MPI stack properly supports Slurm, it is going to do the proper
thing. If the resulting srun call to start the MPI ranks is wrong, the
stack's interaction to Slurm should be fixed.

In both cases, invoking via srun directly or via mpirun, isn't proper
integration needed to make things work?

Recalling my tests with MPI startup on our main cluster with CentOS 7,
16 real cores per node:

- Intel MPI (compiler 15.0.3 and associated MPI, around that)
-- start via srun: just hangs
-- mpirun --bootstrap slurm: works
-- mpirun --bootstrap ssh: works
- Open MPI (1.8 or so) built --with-slurm but without PMI
-- srun: starts one MPI process on each node
-- mpirun: works
- Open MPI without anything (SSH method)
-- srun: starts one MPI process on each node
-- mpirun: 16 processes on first node only
- Open MPI --with-slurm und mit PMI
-- srun: works
-- mpirun: works

So, for Open MPI build with Slurm and PMI it did not matter whether
srun or mpirun was used. The other builds did not work properly with
srun, but all builds --with-slurm (PMI or not) worked just fine using
mpirun. Intel MPI needs those extra environment setup to make srun
work; I did not have that at hand back then (in my notes there is
something about it not even working with libpmi.so, not sure anymore
how it went wrong).

My point being: If things are set up to work at all with the MPI stack,
I see no value in insisting on using srun. Calling mpirun seems to be
the more robust method.

Since mpirun can be made to work with both Intel via --bootstrap slurm
(the default, actually) and openmpi simply via --with-slurm, we settled
for that method and avoided linking in any code specific to the version
of batch system (like using libpmi.so from an older Slurm version, possibly
causing issues in future).

Is that way of running MPI jobs in Slurm not supported?


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
Universität Hamburg
RRZ / Basisinfrastruktur / HPC
Schlüterstr. 70
20146 Hamburg
Tel.: 040/42838 8826
Fax: 040/428 38 6270

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to