Hi Chris,

> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]> wrote:
> 
> 
> On 22/09/15 07:17, Timothy Brown wrote:
> 
>> This is using mpiexec.hydra with slurm as the bootstrap. 
> 
> Have you tried Intel MPI's native PMI start up mode?
> 
> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
> path to the Slurm libpmi.so file and then you should be able to use srun
> to launch your job instead.
> 

Yeap, to the same effect. Here's what it gives:

srun --mpi=pmi2 
/lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
srun: error: Task launch for 973564.0 failed on node node0453: Socket timed out 
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation



> More here:
> 
> http://slurm.schedmd.com/mpi_guide.html#intel_srun
> 
>> If I switch to OpenMPI the error is:
> 
> Which version, and was it build with --with-slurm and (if you're
> version is not too ancient) --with-pmi=/path/to/slurm/install ?

Yeap. 1.8.5 (for 1.10 we're going to try and move everything to EasyBuild). Yes 
we included PMI and the Slurm option. Our configure statement was:

module purge
module load slurm/slurm
module load gcc/5.1.0
./configure  \
  --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
  --with-threads=posix \
  --enable-mpi-thread-multiple \
  --with-slurm \
  --with-pmi=/curc/slurm/slurm/current/ \
  --enable-static \
  --enable-wrapper-rpath \
  --enable-sensors \
  --enable-mpi-ext=all \
  --with-verbs

It's got me scratching my head, as I started off thinking it was an MPI issue, 
spent awhile getting Intel's hydra and OpenMPI's oob to go over IB instead of 
gig-e. This increased the success rate, but we were still failing.

Tried out a pure PMI (version 1) code (init, rank, size, fini), which worked a 
lot of the times. Which made me think it was MPI again! However that fails 
enough to say it's not MPI. The PMI v2 code I wrote, gives the wrong results 
for rank and world size, so I'm sweeping that under the rug until I understand 
it!

Just wondering if anybody has seen anything like this. Am happy to share our 
conf file if that helps.

The only other thing I could possibly point a finger at (but don't believe it 
is), is that the slurm masters (slurmctld) are only on gig-E.

I'm half thinking of opening a TT, but was hoping to get more information (and 
possibly not increase the logging of slurm, which is my only next idea).

Thanks for your thoughts Chris.

Timothy=

Reply via email to