Hi there. Can you share the output of mpiname -a? In order to use srun with mvapich2 you will need to configure mvapich2 with the following options:
./configure --with-pm=no --with-pmi=slurm On Mon, Nov 25, 2013 at 8:46 AM, Arjun J Rao <[email protected]>wrote: > I have a cluster with two nodes qdr3 and qdr4. I run slurmctld on qdr3 > and slurmd on qdr3 and qdr4 both. I have attached the slurm.conf file. I am > using MVAPICH2 2.0a (the lates is 2.0b) > > I then wrote a simple MPI hello world program that mentions the process > rank and the processor name from whichever node it is run. > > I compiled the code using > mpicc -L/usr/local/lib/slurm -lpmi Hello.c > > where /usr/local/lib/slurm is the place where slurm libraries reside. > Compilation and the subsequent commands were all entered in qdr3's > terminal, where slurmctld runs too. > > > $: salloc -N2 bash > salloc : Granted job allocation 24 > $: sbcast a.out /tmp/random.a.out > $: srun /tmp/random.a.out > In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) > In: PMI_Abort(1,Fatal error in MPI_Init: Other MPI error) > > slurmd[qdr4]: *** STEP 24.0 KILLED AT 2013-11-25T18:52:52 with SIGNAL 9 *** > srun: Job step aborted: Waiting upto 2 seconds for job step to finish > srun: error: qdr3: task 0: Exited with exit code 1 > srun: error: qdr4: task 1: Exited with exit code 1 > > > I checked the /tmp folder on qdr4 and qdr3 and they did contain > random.a.out as a file. I can log in to each machine from the other without > having to use a password. > > Even if I try srun -n4 /tmp/random.a.out > srun -n2 /tmp/random.a.out > srun -n14 /tmp/random.a.out > don't work and give off similar errors. > > > What could be going wrong here ? > -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo
