You are using a very old version of MVAPICH2. Your original message indicated that you are using MVAPICH2 2.0a. This information is not correct. There have been many enhancements in recent releases. Please use the latest release (2.0b) according to the guidelines suggested in this thread and let us know if you encounter any issues.
DK ________________________________ From: Arjun J Rao [[email protected]] Sent: Monday, November 25, 2013 11:58 PM To: slurm-dev Subject: [slurm-dev] Re: MPI_Init failed with MVAPICH2 While installing MVAPICH2, I did use ./configure --with-pm=no --with-pmi=slurm The output of mpiname -a is MVAPICH2 1.7 Thu Oct 13 17:31:44 EDT 2011 ch3:mrail Compilation CC: gcc -DNDEBUG -DNVALGRIND -O2 CXX: g++ -DNDEBUG -DNVALGRIND -O2 F77: gfortran -O2 -L/usr/lib64 FC: gfortran -O2 Configuration --prefix=/usr/mpi/gcc/mvapich2-1.7 --with-rdma=gen2 --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-shared CC=gcc CXX=g++ F77=gfortran FC=gfortran On Mon, Nov 25, 2013 at 9:51 PM, Moe Jette <[email protected]<mailto:[email protected]>> wrote: Each MPI implementation is a bit differrent. Check your MpiDefault configuration parameter; see: http://slurm.schedmd.com/slurm.conf.html Quoting Jonathan Perkins <[email protected]<mailto:[email protected]>>: Hi there. Can you share the output of mpiname -a? In order to use srun with mvapich2 you will need to configure mvapich2 with the following options: ./configure --with-pm=no --with-pmi=slurm On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]<mailto:[email protected]>>wrote: I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am using MVAPICH2 2.0a (the lates is 2.0b) I then wrote a simple MPI hello world program that mentions the process rank and the processor name from whichever node it is run. I compiled the code using mpicc -L/usr/local/lib/slurm -lpmi Hello.c where /usr/local/lib/slurm is the place where slurm libraries reside. Compilation and the subsequent commands were all entered in qdr3's terminal, where slurmctld runs too. $: salloc -N2 bash salloc : Granted job allocation 24 $: sbcast a.out /tmp/random.a.out $: srun /tmp/random.a.out In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 *** srun: Job step aborted: Waiting upto 2 seconds for job step to finish srun: error: qdr3: task 0: Exited with exit code 1 srun: error: qdr4: task 1: Exited with exit code 1 I checked the /tmp folder on qdr4 and qdr3 and they did contain random.a.out as a file. I can log in to each machine from the other without having to use a password. Even if I try srun -n4 /tmp/random.a.out srun -n2 /tmp/random.a.out srun -n14 /tmp/random.a.out don't work and give off similar errors. What could be going wrong here ? -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo
