Aha! I missed it the first time... In your script, replace "mpirun"
 with "srun" and the world should be better.
 
 On 04/30/2016 01:35 PM, Mehdi Acheli
   wrote:
   Re: [slurm-dev] Re: MPI/OpenMPI send receive not working
   
   Euh, I did a "make all install" so I think pmi
     support is installed. And the hello world program is working,
     would it if it wasn't installed ?
   
     2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
       
          For Slurm, after the
           "make install", did you do a "make install-contrib" (which
           builds the pmi2 support)? I think you would have seen a
           runtime error if you hadn't, but possibly not.
             
             On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
             First of all, thank you
                 for the reaction.
                 Here are the answers :
                     [*]I tried multiple commands:
                       [*]I started with "srun -N2 --mpi=pmi2 ptest"
                         then I changed the slurm.conf's mpi
                         parameter to pmi2 so I no longer need the
                         option.
                         [*]I also tried a script submitted via
                         sbatch. It doesn't work either and squeue
                         shows that it's running. My program is just
                         passing a number from node 1 to node 2 so it
                         doesn't normally take that long.
                     [*]OpenMPI version is 1.10.2 / SLURM's is
                       15.08.8
                   [*]I built Slurm myself with no specific options.
                     For OpenMPI I actually downloaded it from the
                     CentOS 7 default repo. But I tried building the
                     same version before with --with-slurm and
                     --with-pmi options, yet it wasn't working
                     either.�
                   I am joining a copy of my slurm.conf file and
                     the script I used to submit the job.
               The script :�
                     /#!/bin/bash
                         #
                         #SBATCH --job-name=test
                         #SBATCH --output=res_mpi.txt
                         #
                         #SBATCH -N 2
                         module load openmpi
                         mpirun test/
                   Slurm.conf file :
                     #
                       slurm.conf file generated by configurator
                       easy.html.
                     #
                       Put this file on all nodes of your cluster.
                     #
                       See the slurm.conf man page for more
                       information.
                     #
                     ControlMachine=m
                     ControlAddr=m
                     BackupController=mb
                     BackupAddr=mb
                     #
                     #MailProg=/bin/mail
                     MpiDefault=pmi2
                     MpiParams=ports=12000-12999
                     ProctrackType=proctrack/linuxproc
                     ReturnToService=2
                     #SlurmctldPidFile=/var/run/slurmctld.pid
                     #SlurmctldPort=6817
                     #SlurmdPidFile=/var/run/slurmd.pid
                     #SlurmdPort=6818
                     SlurmdSpoolDir=/var/spool/slurm/slurmd
                     SlurmUser=slurm
                     #SlurmdUser=root
                     #StateSaveLocation=/var/spool/slurm
                     StateSaveLocation=/mnt/data/spool/slurm
                     SwitchType=switch/none
                     TaskPlugin=task/none
                     #
                     #
                     #
                       TIMERS
                     #KillWait=30
                     #MinJobAge=300
                     #SlurmctldTimeout=120
                     #SlurmdTimeout=300
                     #
                     #
                     #
                       SCHEDULING
                     FastSchedule=1
                     SchedulerType=sched/backfill
                     #SchedulerPort=7321
                     SelectType=select/linear
                     PreemptType=preempt/partition_prio
                     PreemptMode=requeue
                     #
                     #
                     #
                       LOGGING AND ACCOUNTING
                     AccountingStorageType=accounting_storage/slurmdbd
                     #JobAcctGatherFrequency=30
                     JobAcctGatherType=jobacct_gather/linux
                     JobCompType=jobcomp/none
                     #SlurmctldDebug=3
                     #SlurmctldLogFile=/var/log/slurmctld.log
                     SlurmctldLogFile=/mnt/data/log/slurmctld.log
                     #SlurmdDebug=3
                     SlurmdLogFile=/var/log/slurmd.log
                     AccountingStorageBackupHost=mb
                     #
                     #
                     #
                       COMPUTE NODES
                     NodeName=n[1-8]
                         NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
                     NodeName=logn
                         NodeAddr=logn CPUs=1 State=UNKNOWN
                   2016-04-30 16:40 GMT+01:00 Andy
                     Riebs <[email protected]>:
                        Hi,
                         
                         The one problem that I see in your
                         description is minor, and probably not
                         significant: the MPI ports parameter was
                         needed for very old versions of Open MPI,
                         IIRC.
                         
                         To help debug your problems, please respond
                         to this list with
                         
                           [*]What command did you use to invoke
                             your program?
                             [*]What versions of Slurm and OpenMPI are
                             you using?
                           [*]Did you build them yourself, or use
                             prebuilt versions?
                             
                               [*]If you built them yourself, what
                                 configuration options did you use?
                                 [*]If pre-built versions, where did
                                 you get them?
                           [*]A copy of your slurm.conf file (you
                             may want to change node names and other
                             potentially sensitive information)
                         Andy
                           On 04/30/2016 10:02 AM, Mehdi Acheli
                             wrote:
                       Hello everyone,
                         I've set
                             a basic configuration 
using�slurm�with
                             a master node, backup node, a login node
                             and eight compute node.
                         Everything
                               in�slurm�is
                             working fine. I can issue jobs and see
                             the state of the eight nodes as Idle.
                             The problem is with OpenMPI. The hello
                             parallel program where each process
                             prints its rank among the global set is
                             working but when i try to establish
                             communications between nodes through
                             MPI_Send and MPI_Recv, it just hangs
                             there undefinitely.�
                             I'm
                                 using CentOS 7, firewalld and
                                 SElinux are disabled. If i launch my
                                 parallel program, ptest, on 2 nodes
                                 : [n1, n2], a little check with lsof
                                 -i shows that ptest is listening on
                                 port 1024 on both nodes, which i
                                 find weird since only one should be
                                 listening. Moreover, i've set slurm
                                 Mpi parameters on pmi2 and ports
                                 allowed on [12000-12999], so why is
                                 it still using port 1024 ?
                         I hope u
                             can help me with this problem. I can't
                             see what's wrong.�
                          
                             Thank
                                 you in advance.
                             
                                 M. Acheli.

Reply via email to