This sounds like something in Slurm - I don’t know how srun would know to emit 
a message if the app was failing to open a socket between its own procs.

Try starting the OMPI job with “mpirun” instead of srun and see if it has the 
same issue. If not, then that’s pretty convincing that it’s slurm.


> On Sep 21, 2015, at 7:26 PM, Timothy Brown <[email protected]> 
> wrote:
> 
> 
> Hi Chris,
> 
> 
>> On Sep 21, 2015, at 7:36 PM, Christopher Samuel <[email protected]> 
>> wrote:
>> 
>> 
>> On 22/09/15 07:17, Timothy Brown wrote:
>> 
>>> This is using mpiexec.hydra with slurm as the bootstrap. 
>> 
>> Have you tried Intel MPI's native PMI start up mode?
>> 
>> You just need to set the environment variable I_MPI_PMI_LIBRARY to the
>> path to the Slurm libpmi.so file and then you should be able to use srun
>> to launch your job instead.
>> 
> 
> Yeap, to the same effect. Here's what it gives:
> 
> srun --mpi=pmi2 
> /lustre/janus_scratch/tibr1099/osu_impi/libexec/osu-micro-benchmarks//mpi/collective/osu_alltoall
> srun: error: Task launch for 973564.0 failed on node node0453: Socket timed 
> out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv 
> operation
> 
> 
> 
>> More here:
>> 
>> http://slurm.schedmd.com/mpi_guide.html#intel_srun
>> 
>>> If I switch to OpenMPI the error is:
>> 
>> Which version, and was it build with --with-slurm and (if you're
>> version is not too ancient) --with-pmi=/path/to/slurm/install ?
> 
> Yeap. 1.8.5 (for 1.10 we're going to try and move everything to EasyBuild). 
> Yes we included PMI and the Slurm option. Our configure statement was:
> 
> module purge
> module load slurm/slurm
> module load gcc/5.1.0
> ./configure  \
>  --prefix=/curc/tools/x86_64/rh6/software/openmpi/1.8.5/gcc/5.1.0 \
>  --with-threads=posix \
>  --enable-mpi-thread-multiple \
>  --with-slurm \
>  --with-pmi=/curc/slurm/slurm/current/ \
>  --enable-static \
>  --enable-wrapper-rpath \
>  --enable-sensors \
>  --enable-mpi-ext=all \
>  --with-verbs
> 
> It's got me scratching my head, as I started off thinking it was an MPI 
> issue, spent awhile getting Intel's hydra and OpenMPI's oob to go over IB 
> instead of gig-e. This increased the success rate, but we were still failing.
> 
> Tried out a pure PMI (version 1) code (init, rank, size, fini), which worked 
> a lot of the times. Which made me think it was MPI again! However that fails 
> enough to say it's not MPI. The PMI v2 code I wrote, gives the wrong results 
> for rank and world size, so I'm sweeping that under the rug until I 
> understand it!
> 
> Just wondering if anybody has seen anything like this. Am happy to share our 
> conf file if that helps.
> 
> The only other thing I could possibly point a finger at (but don't believe it 
> is), is that the slurm masters (slurmctld) are only on gig-E.
> 
> I'm half thinking of opening a TT, but was hoping to get more information 
> (and possibly not increase the logging of slurm, which is my only next idea).
> 
> Thanks for your thoughts Chris.
> 
> Timothy=

Reply via email to