For Slurm, after the "make install", did you do a "make
 install-contrib" (which builds the pmi2 support)? I think you would
 have seen a runtime error if you hadn't, but possibly not.
 
 On 04/30/2016 12:14 PM, Mehdi Acheli
   wrote:
   Re: [slurm-dev] Re: MPI/OpenMPI send receive not working
   
   First of all, thank you for the reaction.
     Here are the answers :
         [*]I tried multiple commands:
           [*]I started with "srun -N2 --mpi=pmi2 ptest" then I
             changed the slurm.conf's mpi parameter to pmi2 so I no
             longer need the option.
             [*]I also tried a script submitted via sbatch. It doesn't
             work either and squeue shows that it's running. My
             program is just passing a number from node 1 to node 2
             so it doesn't normally take that long.
         [*]OpenMPI version is 1.10.2 / SLURM's is 15.08.8
           [*]I built Slurm myself with no specific options. For
           OpenMPI I actually downloaded it from the CentOS 7 default
           repo. But I tried building the same version before with
           --with-slurm and --with-pmi options, yet it wasn't working
           either.�
       I am joining a copy of my slurm.conf file and the script
         I used to submit the job.
     The script :�
       /#!/bin/bash
           #
           #SBATCH --job-name=test
           #SBATCH --output=res_mpi.txt
           #
           #SBATCH -N 2
           module load openmpi
           mpirun test/
     Slurm.conf file :
       #
         slurm.conf file generated by configurator easy.html.
       #
         Put this file on all nodes of your cluster.
       #
         See the slurm.conf man page for more information.
       #
       ControlMachine=m
       ControlAddr=m
       BackupController=mb
       BackupAddr=mb
       #
       #MailProg=/bin/mail
       MpiDefault=pmi2
       MpiParams=ports=12000-12999
       ProctrackType=proctrack/linuxproc
       ReturnToService=2
       #SlurmctldPidFile=/var/run/slurmctld.pid
       #SlurmctldPort=6817
       #SlurmdPidFile=/var/run/slurmd.pid
       #SlurmdPort=6818
       SlurmdSpoolDir=/var/spool/slurm/slurmd
       SlurmUser=slurm
       #SlurmdUser=root
       #StateSaveLocation=/var/spool/slurm
       StateSaveLocation=/mnt/data/spool/slurm
       SwitchType=switch/none
       TaskPlugin=task/none
       #
       #
       #
         TIMERS
       #KillWait=30
       #MinJobAge=300
       #SlurmctldTimeout=120
       #SlurmdTimeout=300
       #
       #
       #
         SCHEDULING
       FastSchedule=1
       SchedulerType=sched/backfill
       #SchedulerPort=7321
       SelectType=select/linear
       PreemptType=preempt/partition_prio
       PreemptMode=requeue
       #
       #
       #
         LOGGING AND ACCOUNTING
       AccountingStorageType=accounting_storage/slurmdbd
       #JobAcctGatherFrequency=30
       JobAcctGatherType=jobacct_gather/linux
       JobCompType=jobcomp/none
       #SlurmctldDebug=3
       #SlurmctldLogFile=/var/log/slurmctld.log
       SlurmctldLogFile=/mnt/data/log/slurmctld.log
       #SlurmdDebug=3
       SlurmdLogFile=/var/log/slurmd.log
       AccountingStorageBackupHost=mb
       #
       #
       #
         COMPUTE NODES
       NodeName=n[1-8]
         NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
       NodeName=logn
         NodeAddr=logn CPUs=1 State=UNKNOWN
     2016-04-30 16:40 GMT+01:00 Andy Riebs <[email protected]>:
       
          Hi,
           
           The one problem that I see in your description is minor,
           and probably not significant: the MPI ports parameter was
           needed for very old versions of Open MPI, IIRC.
           
           To help debug your problems, please respond to this list
           with
           
             [*]What command did you use to invoke your program?
               [*]What versions of Slurm and OpenMPI are you using?
             [*]Did you build them yourself, or use prebuilt
               versions?
               
                 [*]If you built them yourself, what configuration
                   options did you use?
                   [*]If pre-built versions, where did you get them?
             [*]A copy of your slurm.conf file (you may want to
               change node names and other potentially sensitive
               information)
           Andy
             On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
             Hello everyone,
               I've set a basic
                   configuration using�slurm�with a master node,
                   backup node, a login node and eight compute node.
               Everything
                   in�slurm�is working fine. I
                   can issue jobs and see the state of the eight
                   nodes as Idle. The problem is with OpenMPI. The
                   hello parallel program where each process prints
                   its rank among the global set is working but when
                   i try to establish communications between nodes
                   through MPI_Send and MPI_Recv, it just hangs there
                   undefinitely.�
                 I'm using CentOS
                     7, firewalld and SElinux are disabled. If i
                     launch my parallel program, ptest, on 2 nodes :
                     [n1, n2], a little check with lsof -i shows that
                     ptest is listening on port 1024 on both nodes,
                     which i find weird since only one should be
                     listening. Moreover, i've set slurm Mpi
                     parameters on pmi2 and ports allowed on
                     [12000-12999], so why is it still using port
                     1024 ?
               I hope u can help
                   me with this problem. I can't see what's
                   wrong.�
               
                 Thank you in
                     advance.
                 
                     M. Acheli.

Reply via email to