Each MPI implementation is a bit differrent. Check your MpiDefault configuration parameter; see:
http://slurm.schedmd.com/slurm.conf.html

Quoting Jonathan Perkins <[email protected]>:

Hi there.  Can you share the output of mpiname -a?  In order to use srun
with mvapich2 you will need to configure mvapich2 with the following
options:

./configure --with-pm=no --with-pmi=slurm


On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]>wrote:

 I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3
and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am
using MVAPICH2 2.0a (the lates is 2.0b)

I then wrote a simple MPI hello world program that mentions the process
rank and the processor name from whichever node it is run.

I compiled the code using
mpicc -L/usr/local/lib/slurm -lpmi Hello.c

where /usr/local/lib/slurm is the place where slurm libraries reside.
Compilation and the subsequent commands were all entered in qdr3's
terminal, where slurmctld runs too.


$: salloc -N2 bash
salloc : Granted job allocation 24
$: sbcast a.out /tmp/random.a.out
$: srun /tmp/random.a.out
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)
In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error)

slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 ***
srun: Job step aborted: Waiting upto 2 seconds for job step to finish
srun: error: qdr3: task 0: Exited with exit code 1
srun: error: qdr4: task 1: Exited with exit code 1


I checked the /tmp folder on qdr4 and qdr3 and they did contain
random.a.out as a file. I can log in to each machine from the other without
having to use a password.

Even if I try srun -n4 /tmp/random.a.out
                  srun -n2 /tmp/random.a.out
                  srun -n14 /tmp/random.a.out
don't work and give off similar errors.


What could be going wrong here ?




--
Jonathan Perkins
http://www.cse.ohio-state.edu/~perkinjo


Reply via email to