Hi there.  Can you share the output of mpiname -a?  In order to use srun
with mvapich2 you will need to configure mvapich2 with the following
options:

./configure --with-pm=no --with-pmi=slurm


On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]>wrote:

>  I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3
> and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am
> using MVAPICH2 2.0a (the lates is 2.0b)
>
> I then wrote a simple MPI hello world program that mentions the process
> rank and the processor name from whichever node it is run.
>
> I compiled the code using
> mpicc -L/usr/local/lib/slurm -lpmi Hello.c
>
> where /usr/local/lib/slurm is the place where slurm libraries reside.
> Compilation and the subsequent commands were all entered in qdr3's
> terminal, where slurmctld runs too.
>
>
> $: salloc -N2 bash
> salloc : Granted job allocation 24
> $: sbcast a.out /tmp/random.a.out
> $: srun /tmp/random.a.out
> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
> In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
>
> slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 ***
> srun: Job step aborted: Waiting upto 2 seconds for job step to finish
> srun: error: qdr3: task 0: Exited with exit code 1
> srun: error: qdr4: task 1: Exited with exit code 1
>
>
> I checked the /tmp folder on qdr4 and qdr3 and they did contain
> random.a.out as a file. I can log in to each machine from the other without
> having to use a password.
>
> Even if I try srun -n4 /tmp/random.a.out
>                   srun -n2 /tmp/random.a.out
>                   srun -n14 /tmp/random.a.out
> don't work and give off similar errors.
>
>
> What could be going wrong here ?
>



-- 
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo

Reply via email to