Euh, I did a "make all install" so I think pmi support is installed. And
the hello world program is working, would it if it wasn't installed ?

2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:

> For Slurm, after the "make install", did you do a "make install-contrib"
> (which builds the pmi2 support)? I think you would have seen a runtime
> error if you hadn't, but possibly not.
>
> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>
> First of all, thank you for the reaction.
>
> Here are the answers :
>
>    1. I tried multiple commands:
>       1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the
>       slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>       2. I also tried a script submitted via sbatch. It doesn't work
>       either and squeue shows that it's running. My program is just passing a
>       number from node 1 to node 2 so it doesn't normally take that long.
>    2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>    3. I built Slurm myself with no specific options. For OpenMPI I
>    actually downloaded it from the CentOS 7 default repo. But I tried building
>    the same version before with --with-slurm and --with-pmi options, yet it
>    wasn't working either.�
>
> I am joining a copy of my slurm.conf file and the script I used to submit
> the job.
>
> The script :�
>
>
>>
>>
>>
>>
>>
>>
>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt #
>> #SBATCH -N 2 module load openmpi mpirun test*
>
>
> Slurm.conf file :
>
>
> # slurm.conf file generated by configurator easy.html.
>>
>> # Put this file on all nodes of your cluster.
>>
>> # See the slurm.conf man page for more information.
>>
>> #
>>
>> ControlMachine=m
>>
>> ControlAddr=m
>>
>> BackupController=mb
>>
>> BackupAddr=mb
>>
>> #
>>
>> #MailProg=/bin/mail
>>
>> MpiDefault=pmi2
>>
>> MpiParams=ports=12000-12999
>>
>> ProctrackType=proctrack/linuxproc
>>
>> ReturnToService=2
>>
>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>
>> #SlurmctldPort=6817
>>
>> #SlurmdPidFile=/var/run/slurmd.pid
>>
>> #SlurmdPort=6818
>>
>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>
>> SlurmUser=slurm
>>
>> #SlurmdUser=root
>>
>> #StateSaveLocation=/var/spool/slurm
>>
>> StateSaveLocation=/mnt/data/spool/slurm
>>
>> SwitchType=switch/none
>>
>> TaskPlugin=task/none
>>
>> #
>>
>> #
>>
>> # TIMERS
>>
>> #KillWait=30
>>
>> #MinJobAge=300
>>
>> #SlurmctldTimeout=120
>>
>> #SlurmdTimeout=300
>>
>> #
>>
>> #
>>
>> # SCHEDULING
>>
>> FastSchedule=1
>>
>> SchedulerType=sched/backfill
>>
>> #SchedulerPort=7321
>>
>> SelectType=select/linear
>>
>> PreemptType=preempt/partition_prio
>>
>> PreemptMode=requeue
>>
>> #
>>
>> #
>>
>> # LOGGING AND ACCOUNTING
>>
>> AccountingStorageType=accounting_storage/slurmdbd
>>
>> #JobAcctGatherFrequency=30
>>
>> JobAcctGatherType=jobacct_gather/linux
>>
>> JobCompType=jobcomp/none
>>
>> #SlurmctldDebug=3
>>
>> #SlurmctldLogFile=/var/log/slurmctld.log
>>
>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>
>> #SlurmdDebug=3
>>
>> SlurmdLogFile=/var/log/slurmd.log
>>
>> AccountingStorageBackupHost=mb
>>
>> #
>>
>> #
>>
>> # COMPUTE NODES
>>
>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>
>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>
>>
> 2016-04-30 16:40 GMT+01:00 Andy Riebs <[email protected]>:
>
>> Hi,
>>
>> The one problem that I see in your description is minor, and probably not
>> significant: the MPI ports parameter was needed for very old versions of
>> Open MPI, IIRC.
>>
>> To help debug your problems, please respond to this list with
>>
>>    1. What command did you use to invoke your program?
>>    2. What versions of Slurm and OpenMPI are you using?
>>    3. Did you build them yourself, or use prebuilt versions?
>>    - If you built them yourself, what configuration options did you use?
>>       - If pre-built versions, where did you get them?
>>    4. A copy of your slurm.conf file (you may want to change node names
>>    and other potentially sensitive information)
>>
>> Andy
>>
>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>
>> Hello everyone,
>>
>> I've set a basic configuration using�slurm�with a master node,
>> backup node, a login node and eight compute node.
>>
>> Everything in�slurm�is working fine. I can issue jobs and see
>> the state of the eight nodes as Idle. The problem is with OpenMPI. The
>> hello parallel program where each process prints its rank among the global
>> set is working but when i try to establish communications between nodes
>> through MPI_Send and MPI_Recv, it just hangs there undefinitely.�
>>
>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my
>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i
>> shows that ptest is listening on port 1024 on both nodes, which i find
>> weird since only one should be listening. Moreover, i've set slurm Mpi
>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still
>> using port 1024 ?
>>
>> I hope u can help me with this problem. I can't see what's wrong.�
>> Thank you in advance.
>>
>> M. Acheli.
>>
>>
>>
>
>

Reply via email to