While installing MVAPICH2, I did use ./configure --with-pm=no
--with-pmi=slurm
The output of mpiname -a is

MVAPICH2 1.7 Thu Oct 13 17:31:44 EDT 2011 ch3:mrail

Compilation
CC: gcc    -DNDEBUG -DNVALGRIND -O2
CXX: g++   -DNDEBUG -DNVALGRIND -O2
F77: gfortran   -O2 -L/usr/lib64
FC: gfortran   -O2

Configuration
--prefix=/usr/mpi/gcc/mvapich2-1.7 --with-rdma=gen2
--with-ib-include=/usr/include --with-ib-libpath=/usr/lib64 --enable-shared
CC=gcc CXX=g++ F77=gfortran FC=gfortran



On Mon, Nov 25, 2013 at 9:51 PM, Moe Jette <[email protected]> wrote:

>
> Each MPI implementation is a bit differrent. Check your MpiDefault
> configuration parameter; see:
> http://slurm.schedmd.com/slurm.conf.html
>
> Quoting Jonathan Perkins <[email protected]>:
>
>  Hi there.  Can you share the output of mpiname -a?  In order to use srun
>> with mvapich2 you will need to configure mvapich2 with the following
>> options:
>>
>> ./configure --with-pm=no --with-pmi=slurm
>>
>>
>> On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]>
>> wrote:
>>
>>   I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3
>>> and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I
>>> am
>>> using MVAPICH2 2.0a (the lates is 2.0b)
>>>
>>> I then wrote a simple MPI hello world program that mentions the process
>>> rank and the processor name from whichever node it is run.
>>>
>>> I compiled the code using
>>> mpicc -L/usr/local/lib/slurm -lpmi Hello.c
>>>
>>> where /usr/local/lib/slurm is the place where slurm libraries reside.
>>> Compilation and the subsequent commands were all entered in qdr3's
>>> terminal, where slurmctld runs too.
>>>
>>>
>>> $: salloc -N2 bash
>>> salloc : Granted job allocation 24
>>> $: sbcast a.out /tmp/random.a.out
>>> $: srun /tmp/random.a.out
>>> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
>>> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
>>>
>>> slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9
>>> ***
>>> srun: Job step aborted: Waiting upto 2 seconds for job step to finish
>>> srun: error: qdr3: task 0: Exited with exit code 1
>>> srun: error: qdr4: task 1: Exited with exit code 1
>>>
>>>
>>> I checked the /tmp folder on qdr4 and qdr3 and they did contain
>>> random.a.out as a file. I can log in to each machine from the other
>>> without
>>> having to use a password.
>>>
>>> Even if I try srun -n4 /tmp/random.a.out
>>>                   srun -n2 /tmp/random.a.out
>>>                   srun -n14 /tmp/random.a.out
>>> don't work and give off similar errors.
>>>
>>>
>>> What could be going wrong here ?
>>>
>>>
>>
>>
>> --
>> Jonathan Perkins
>> http://www.cse.ohio-state.edu/~perkinjo
>>
>>
>

Reply via email to