I just did. Permit me to include a capture of the script output file:

[image: Images intégrées 1]

I specify in my script the option "-N 2", but it looks like the world_size
is composed of only one process and both nodes are trying to execute an
MPI_Send !

2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>:

> Aha! I missed it the first time... In your script, replace "mpirun" with
> "srun" and the world should be better.
>
> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>
> Euh, I did a "make all install" so I think pmi support is installed. And
> the hello world program is working, would it if it wasn't installed ?
>
> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
>
>> For Slurm, after the "make install", did you do a "make install-contrib"
>> (which builds the pmi2 support)? I think you would have seen a runtime
>> error if you hadn't, but possibly not.
>>
>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>
>> First of all, thank you for the reaction.
>>
>> Here are the answers :
>>
>>    1. I tried multiple commands:
>>       1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the
>>       slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>>       2. I also tried a script submitted via sbatch. It doesn't work
>>       either and squeue shows that it's running. My program is just passing a
>>       number from node 1 to node 2 so it doesn't normally take that long.
>>    2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>    3. I built Slurm myself with no specific options. For OpenMPI I
>>    actually downloaded it from the CentOS 7 default repo. But I tried 
>> building
>>    the same version before with --with-slurm and --with-pmi options, yet it
>>    wasn't working either.�
>>
>> I am joining a copy of my slurm.conf file and the script I used to submit
>> the job.
>>
>> The script :�
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt #
>>> #SBATCH -N 2 module load openmpi mpirun test*
>>
>>
>> Slurm.conf file :
>>
>>
>> # slurm.conf file generated by configurator easy.html.
>>>
>>> # Put this file on all nodes of your cluster.
>>>
>>> # See the slurm.conf man page for more information.
>>>
>>> #
>>>
>>> ControlMachine=m
>>>
>>> ControlAddr=m
>>>
>>> BackupController=mb
>>>
>>> BackupAddr=mb
>>>
>>> #
>>>
>>> #MailProg=/bin/mail
>>>
>>> MpiDefault=pmi2
>>>
>>> MpiParams=ports=12000-12999
>>>
>>> ProctrackType=proctrack/linuxproc
>>>
>>> ReturnToService=2
>>>
>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>
>>> #SlurmctldPort=6817
>>>
>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>
>>> #SlurmdPort=6818
>>>
>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>
>>> SlurmUser=slurm
>>>
>>> #SlurmdUser=root
>>>
>>> #StateSaveLocation=/var/spool/slurm
>>>
>>> StateSaveLocation=/mnt/data/spool/slurm
>>>
>>> SwitchType=switch/none
>>>
>>> TaskPlugin=task/none
>>>
>>> #
>>>
>>> #
>>>
>>> # TIMERS
>>>
>>> #KillWait=30
>>>
>>> #MinJobAge=300
>>>
>>> #SlurmctldTimeout=120
>>>
>>> #SlurmdTimeout=300
>>>
>>> #
>>>
>>> #
>>>
>>> # SCHEDULING
>>>
>>> FastSchedule=1
>>>
>>> SchedulerType=sched/backfill
>>>
>>> #SchedulerPort=7321
>>>
>>> SelectType=select/linear
>>>
>>> PreemptType=preempt/partition_prio
>>>
>>> PreemptMode=requeue
>>>
>>> #
>>>
>>> #
>>>
>>> # LOGGING AND ACCOUNTING
>>>
>>> AccountingStorageType=accounting_storage/slurmdbd
>>>
>>> #JobAcctGatherFrequency=30
>>>
>>> JobAcctGatherType=jobacct_gather/linux
>>>
>>> JobCompType=jobcomp/none
>>>
>>> #SlurmctldDebug=3
>>>
>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>
>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>
>>> #SlurmdDebug=3
>>>
>>> SlurmdLogFile=/var/log/slurmd.log
>>>
>>> AccountingStorageBackupHost=mb
>>>
>>> #
>>>
>>> #
>>>
>>> # COMPUTE NODES
>>>
>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>
>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>
>>>
>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]>
>> [email protected]>:
>>
>>> Hi,
>>>
>>> The one problem that I see in your description is minor, and probably
>>> not significant: the MPI ports parameter was needed for very old versions
>>> of Open MPI, IIRC.
>>>
>>> To help debug your problems, please respond to this list with
>>>
>>>    1. What command did you use to invoke your program?
>>>    2. What versions of Slurm and OpenMPI are you using?
>>>    3. Did you build them yourself, or use prebuilt versions?
>>>    - If you built them yourself, what configuration options did you use?
>>>       - If pre-built versions, where did you get them?
>>>    4. A copy of your slurm.conf file (you may want to change node names
>>>    and other potentially sensitive information)
>>>
>>> Andy
>>>
>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>
>>> Hello everyone,
>>>
>>> I've set a basic configuration using�slurm�with a
>>> master node, backup node, a login node and eight compute node.
>>>
>>> Everything in�slurm�is working fine. I can issue
>>> jobs and see the state of the eight nodes as Idle. The problem is with
>>> OpenMPI. The hello parallel program where each process prints its rank
>>> among the global set is working but when i try to establish communications
>>> between nodes through MPI_Send and MPI_Recv, it just hangs there
>>> undefinitely.�
>>>
>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my
>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i
>>> shows that ptest is listening on port 1024 on both nodes, which i find
>>> weird since only one should be listening. Moreover, i've set slurm Mpi
>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still
>>> using port 1024 ?
>>>
>>> I hope u can help me with this problem. I can't see what's
>>> wrong.�
>>> Thank you in advance.
>>>
>>> M. Acheli.
>>>
>>>
>>>
>>
>>
>
>

Reply via email to