First of all, thank you for the reaction.
Here are the answers :
1. I tried multiple commands:
1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the
slurm.conf's mpi parameter to pmi2 so I no longer need the option.
2. I also tried a script submitted via sbatch. It doesn't work either
and squeue shows that it's running. My program is just passing a number
from node 1 to node 2 so it doesn't normally take that long.
2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
3. I built Slurm myself with no specific options. For OpenMPI I actually
downloaded it from the CentOS 7 default repo. But I tried building the same
version before with --with-slurm and --with-pmi options, yet it wasn't
working either.
I am joining a copy of my slurm.conf file and the script I used to submit
the job.
The script :
>
>
>
>
>
>
> *#!/bin/bash##SBATCH --job-name=test#SBATCH --output=res_mpi.txt##SBATCH
> -N 2module load openmpimpirun test*
Slurm.conf file :
# slurm.conf file generated by configurator easy.html.
>
> # Put this file on all nodes of your cluster.
>
> # See the slurm.conf man page for more information.
>
> #
>
> ControlMachine=m
>
> ControlAddr=m
>
> BackupController=mb
>
> BackupAddr=mb
>
> #
>
> #MailProg=/bin/mail
>
> MpiDefault=pmi2
>
> MpiParams=ports=12000-12999
>
> ProctrackType=proctrack/linuxproc
>
> ReturnToService=2
>
> #SlurmctldPidFile=/var/run/slurmctld.pid
>
> #SlurmctldPort=6817
>
> #SlurmdPidFile=/var/run/slurmd.pid
>
> #SlurmdPort=6818
>
> SlurmdSpoolDir=/var/spool/slurm/slurmd
>
> SlurmUser=slurm
>
> #SlurmdUser=root
>
> #StateSaveLocation=/var/spool/slurm
>
> StateSaveLocation=/mnt/data/spool/slurm
>
> SwitchType=switch/none
>
> TaskPlugin=task/none
>
> #
>
> #
>
> # TIMERS
>
> #KillWait=30
>
> #MinJobAge=300
>
> #SlurmctldTimeout=120
>
> #SlurmdTimeout=300
>
> #
>
> #
>
> # SCHEDULING
>
> FastSchedule=1
>
> SchedulerType=sched/backfill
>
> #SchedulerPort=7321
>
> SelectType=select/linear
>
> PreemptType=preempt/partition_prio
>
> PreemptMode=requeue
>
> #
>
> #
>
> # LOGGING AND ACCOUNTING
>
> AccountingStorageType=accounting_storage/slurmdbd
>
> #JobAcctGatherFrequency=30
>
> JobAcctGatherType=jobacct_gather/linux
>
> JobCompType=jobcomp/none
>
> #SlurmctldDebug=3
>
> #SlurmctldLogFile=/var/log/slurmctld.log
>
> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>
> #SlurmdDebug=3
>
> SlurmdLogFile=/var/log/slurmd.log
>
> AccountingStorageBackupHost=mb
>
> #
>
> #
>
> # COMPUTE NODES
>
> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>
> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>
>
2016-04-30 16:40 GMT+01:00 Andy Riebs <[email protected]>:
> Hi,
>
> The one problem that I see in your description is minor, and probably not
> significant: the MPI ports parameter was needed for very old versions of
> Open MPI, IIRC.
>
> To help debug your problems, please respond to this list with
>
> 1. What command did you use to invoke your program?
> 2. What versions of Slurm and OpenMPI are you using?
> 3. Did you build them yourself, or use prebuilt versions?
> - If you built them yourself, what configuration options did you use?
> - If pre-built versions, where did you get them?
> 4. A copy of your slurm.conf file (you may want to change node names
> and other potentially sensitive information)
>
> Andy
>
> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>
> Hello everyone,
>
> I've set a basic configuration using�slurm�with a master node, backup
> node, a login node and eight compute node.
>
> Everything in�slurm�is working fine. I can issue jobs and see the
> state of the eight nodes as Idle. The problem is with OpenMPI. The hello
> parallel program where each process prints its rank among the global set is
> working but when i try to establish communications between nodes through
> MPI_Send and MPI_Recv, it just hangs there undefinitely.�
>
> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my
> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i
> shows that ptest is listening on port 1024 on both nodes, which i find
> weird since only one should be listening. Moreover, i've set slurm Mpi
> parameters on pmi2 and ports allowed on [12000-12999], so why is it still
> using port 1024 ?
>
> I hope u can help me with this problem. I can't see what's wrong.�
> Thank you in advance.
>
> M. Acheli.
>
>
>