You are using a very old version of MVAPICH2. Your original message indicated 
that you are using
MVAPICH2 2.0a. This information is not correct. There have been many 
enhancements in recent
releases. Please use the latest release (2.0b) according to the guidelines 
suggested in this
thread and let us know if you encounter any issues.

DK
________________________________
From: Arjun J Rao [[email protected]]
Sent: Monday, November 25, 2013 11:58 PM
To: slurm-dev
Subject: [slurm-dev] Re: MPI_Init failed with MVAPICH2

While installing MVAPICH2, I did use ./configure --with-pm=no --with-pmi=slurm
The output of mpiname -a is

MVAPICH2 1.7 Thu Oct 13 17:31:44 EDT 2011 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran   -O2 -L/usr/lib64
FC: gfortran   -O2

Configuration
--prefix=/usr/mpi/gcc/mvapich2-1.7 --with-rdma=gen2 
--with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-shared 
CC=gcc CXX=g++ F77=gfortran FC=gfortran



On Mon, Nov 25, 2013 at 9:51 PM, Moe Jette 
<[email protected]<mailto:[email protected]>> wrote:

Each MPI implementation is a bit differrent. Check your MpiDefault 
configuration parameter; see:
http://slurm.schedmd.com/slurm.conf.html

Quoting Jonathan Perkins 
<[email protected]<mailto:[email protected]>>:

Hi there.  Can you share the output of mpiname -a?  In order to use srun
with mvapich2 you will need to configure mvapich2 with the following
options:

./configure --with-pm=no --with-pmi=slurm


On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao 
<[email protected]<mailto:[email protected]>>wrote:

 I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3
and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am
using MVAPICH2 2.0a (the lates is 2.0b)

I then wrote a simple MPI hello world program that mentions the process
rank and the processor name from whichever node it is run.

I compiled the code using
mpicc -L/usr/local/lib/slurm -lpmi Hello.c

where /usr/local/lib/slurm is the place where slurm libraries reside.
Compilation and the subsequent commands were all entered in qdr3's
terminal, where slurmctld runs too.


$: salloc -N2 bash
salloc : Granted job allocation 24
$: sbcast a.out /tmp/random.a.out
$: srun /tmp/random.a.out
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)

slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 ***
srun: Job step aborted: Waiting upto 2 seconds for job step to finish
srun: error: qdr3: task 0: Exited with exit code 1
srun: error: qdr4: task 1: Exited with exit code 1


I checked the /tmp folder on qdr4 and qdr3 and they did contain
random.a.out as a file. I can log in to each machine from the other without
having to use a password.

Even if I try srun -n4 /tmp/random.a.out
                  srun -n2 /tmp/random.a.out
                  srun -n14 /tmp/random.a.out
don't work and give off similar errors.


What could be going wrong here ?




--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo




Reply via email to