For Slurm, after the "make install", did you do a "make
install-contrib" (which builds the pmi2 support)? I think you would
have seen a runtime error if you hadn't, but possibly not.
On 04/30/2016 12:14 PM, Mehdi Acheli
wrote:
Re: [slurm-dev] Re: MPI/OpenMPI send receive not working
First of all, thank you for the reaction.
Here are the answers :
[*]I tried multiple commands:
[*]I started with "srun -N2 --mpi=pmi2 ptest" then I
changed the slurm.conf's mpi parameter to pmi2 so I no
longer need the option.
[*]I also tried a script submitted via sbatch. It doesn't
work either and squeue shows that it's running. My
program is just passing a number from node 1 to node 2
so it doesn't normally take that long.
[*]OpenMPI version is 1.10.2 / SLURM's is 15.08.8
[*]I built Slurm myself with no specific options. For
OpenMPI I actually downloaded it from the CentOS 7 default
repo. But I tried building the same version before with
--with-slurm and --with-pmi options, yet it wasn't working
either.�
I am joining a copy of my slurm.conf file and the script
I used to submit the job.
The script :�
/#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res_mpi.txt
#
#SBATCH -N 2
module load openmpi
mpirun test/
Slurm.conf file :
#
slurm.conf file generated by configurator easy.html.
#
Put this file on all nodes of your cluster.
#
See the slurm.conf man page for more information.
#
ControlMachine=m
ControlAddr=m
BackupController=mb
BackupAddr=mb
#
#MailProg=/bin/mail
MpiDefault=pmi2
MpiParams=ports=12000-12999
ProctrackType=proctrack/linuxproc
ReturnToService=2
#SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
#SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd
SlurmUser=slurm
#SlurmdUser=root
#StateSaveLocation=/var/spool/slurm
StateSaveLocation=/mnt/data/spool/slurm
SwitchType=switch/none
TaskPlugin=task/none
#
#
#
TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
#
SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
PreemptType=preempt/partition_prio
PreemptMode=requeue
#
#
#
LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/none
#SlurmctldDebug=3
#SlurmctldLogFile=/var/log/slurmctld.log
SlurmctldLogFile=/mnt/data/log/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
AccountingStorageBackupHost=mb
#
#
#
COMPUTE NODES
NodeName=n[1-8]
NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
NodeName=logn
NodeAddr=logn CPUs=1 State=UNKNOWN
2016-04-30 16:40 GMT+01:00 Andy Riebs <[email protected]>:
Hi,
The one problem that I see in your description is minor,
and probably not significant: the MPI ports parameter was
needed for very old versions of Open MPI, IIRC.
To help debug your problems, please respond to this list
with
[*]What command did you use to invoke your program?
[*]What versions of Slurm and OpenMPI are you using?
[*]Did you build them yourself, or use prebuilt
versions?
[*]If you built them yourself, what configuration
options did you use?
[*]If pre-built versions, where did you get them?
[*]A copy of your slurm.conf file (you may want to
change node names and other potentially sensitive
information)
Andy
On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
Hello everyone,
I've set a basic
configuration using�slurm�with a master node,
backup node, a login node and eight compute node.
Everything
in�slurm�is working fine. I
can issue jobs and see the state of the eight
nodes as Idle. The problem is with OpenMPI. The
hello parallel program where each process prints
its rank among the global set is working but when
i try to establish communications between nodes
through MPI_Send and MPI_Recv, it just hangs there
undefinitely.�
I'm using CentOS
7, firewalld and SElinux are disabled. If i
launch my parallel program, ptest, on 2 nodes :
[n1, n2], a little check with lsof -i shows that
ptest is listening on port 1024 on both nodes,
which i find weird since only one should be
listening. Moreover, i've set slurm Mpi
parameters on pmi2 and ports allowed on
[12000-12999], so why is it still using port
1024 ?
I hope u can help
me with this problem. I can't see what's
wrong.�
Thank you in
advance.
M. Acheli.