While installing MVAPICH2, I did use ./configure --with-pm=no --with-pmi=slurm The output of mpiname -a is
MVAPICH2 1.7 Thu Oct 13 17:31:44 EDT 2011 ch3:mrail Compilation CC: gcc -DNDEBUG -DNVALGRIND -O2 CXX: g++ -DNDEBUG -DNVALGRIND -O2 F77: gfortran -O2 -L/usr/lib64 FC: gfortran -O2 Configuration --prefix=/usr/mpi/gcc/mvapich2-1.7 --with-rdma=gen2 --with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-shared CC=gcc CXX=g++ F77=gfortran FC=gfortran On Mon, Nov 25, 2013 at 9:51 PM, Moe Jette <[email protected]> wrote: > > Each MPI implementation is a bit differrent. Check your MpiDefault > configuration parameter; see: > http://slurm.schedmd.com/slurm.conf.html > > Quoting Jonathan Perkins <[email protected]>: > > Hi there. Can you share the output of mpiname -a? In order to use srun >> with mvapich2 you will need to configure mvapich2 with the following >> options: >> >> ./configure --with-pm=no --with-pmi=slurm >> >> >> On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]> >> wrote: >> >> I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 >>> and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I >>> am >>> using MVAPICH2 2.0a (the lates is 2.0b) >>> >>> I then wrote a simple MPI hello world program that mentions the process >>> rank and the processor name from whichever node it is run. >>> >>> I compiled the code using >>> mpicc -L/usr/local/lib/slurm -lpmi Hello.c >>> >>> where /usr/local/lib/slurm is the place where slurm libraries reside. >>> Compilation and the subsequent commands were all entered in qdr3's >>> terminal, where slurmctld runs too. >>> >>> >>> $: salloc -N2 bash >>> salloc : Granted job allocation 24 >>> $: sbcast a.out /tmp/random.a.out >>> $: srun /tmp/random.a.out >>> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) >>> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) >>> >>> slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 >>> *** >>> srun: Job step aborted: Waiting upto 2 seconds for job step to finish >>> srun: error: qdr3: task 0: Exited with exit code 1 >>> srun: error: qdr4: task 1: Exited with exit code 1 >>> >>> >>> I checked the /tmp folder on qdr4 and qdr3 and they did contain >>> random.a.out as a file. I can log in to each machine from the other >>> without >>> having to use a password. >>> >>> Even if I try srun -n4 /tmp/random.a.out >>> srun -n2 /tmp/random.a.out >>> srun -n14 /tmp/random.a.out >>> don't work and give off similar errors. >>> >>> >>> What could be going wrong here ? >>> >>> >> >> >> -- >> Jonathan Perkins >> http://www.cse.ohio-state.edu/~perkinjo >> >> >
