Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello
world program is doing well. However my original program is still blocking
on the send and receive lines.

2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected]>:

> Your slurm-OMPI integration is clearly broken - the processes do not
> realize they are operating in a common world. Does it work if you use
> mpirun instead of srun?
>
>
> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected]> wrote:
>
> No, I just tested another program and it seems that the world_size is
> reduced to one even though i launch the job on two nodes. The hello program
> is doing the same. Well, I am completely lost now.
> <Capture.PNG>
>
> <Capture.PNG><Capture1.PNG>
>
> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>:
>
>> This looks like a bug in your program - you specified an invalid rank
>> when attempting to send.
>>
>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote:
>>
>> I just did. Permit me to include a capture of the script output file:
>>
>> <Capture.PNG>
>>
>> I specify in my script the option "-N 2", but it looks like the
>> world_size is composed of only one process and both nodes are trying to
>> execute an MPI_Send !
>>
>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>:
>>
>>> Aha! I missed it the first time... In your script, replace "mpirun" with
>>> "srun" and the world should be better.
>>>
>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>>
>>> Euh, I did a "make all install" so I think pmi support is installed. And
>>> the hello world program is working, would it if it wasn't installed ?
>>>
>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
>>>
>>>> For Slurm, after the "make install", did you do a "make
>>>> install-contrib" (which builds the pmi2 support)? I think you would have
>>>> seen a runtime error if you hadn't, but possibly not.
>>>>
>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>>
>>>> First of all, thank you for the reaction.
>>>>
>>>> Here are the answers :
>>>>
>>>>    1. I tried multiple commands:
>>>>       1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the
>>>>       slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>>>>       2. I also tried a script submitted via sbatch. It doesn't work
>>>>       either and squeue shows that it's running. My program is just 
>>>> passing a
>>>>       number from node 1 to node 2 so it doesn't normally take that long.
>>>>    2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>>    3. I built Slurm myself with no specific options. For OpenMPI I
>>>>    actually downloaded it from the CentOS 7 default repo. But I tried 
>>>> building
>>>>    the same version before with --with-slurm and --with-pmi options, yet it
>>>>    wasn't working either.�
>>>>
>>>> I am joining a copy of my slurm.conf file and the script I used to
>>>> submit the job.
>>>>
>>>> The script :�
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt #
>>>>> #SBATCH -N 2 module load openmpi mpirun test*
>>>>
>>>>
>>>> Slurm.conf file :
>>>>
>>>>
>>>> # slurm.conf file generated by configurator easy.html.
>>>>>
>>>>> # Put this file on all nodes of your cluster.
>>>>>
>>>>> # See the slurm.conf man page for more information.
>>>>>
>>>>> #
>>>>>
>>>>> ControlMachine=m
>>>>>
>>>>> ControlAddr=m
>>>>>
>>>>> BackupController=mb
>>>>>
>>>>> BackupAddr=mb
>>>>>
>>>>> #
>>>>>
>>>>> #MailProg=/bin/mail
>>>>>
>>>>> MpiDefault=pmi2
>>>>>
>>>>> MpiParams=ports=12000-12999
>>>>>
>>>>> ProctrackType=proctrack/linuxproc
>>>>>
>>>>> ReturnToService=2
>>>>>
>>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>>>
>>>>> #SlurmctldPort=6817
>>>>>
>>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>>>
>>>>> #SlurmdPort=6818
>>>>>
>>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>>>
>>>>> SlurmUser=slurm
>>>>>
>>>>> #SlurmdUser=root
>>>>>
>>>>> #StateSaveLocation=/var/spool/slurm
>>>>>
>>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>>>
>>>>> SwitchType=switch/none
>>>>>
>>>>> TaskPlugin=task/none
>>>>>
>>>>> #
>>>>>
>>>>> #
>>>>>
>>>>> # TIMERS
>>>>>
>>>>> #KillWait=30
>>>>>
>>>>> #MinJobAge=300
>>>>>
>>>>> #SlurmctldTimeout=120
>>>>>
>>>>> #SlurmdTimeout=300
>>>>>
>>>>> #
>>>>>
>>>>> #
>>>>>
>>>>> # SCHEDULING
>>>>>
>>>>> FastSchedule=1
>>>>>
>>>>> SchedulerType=sched/backfill
>>>>>
>>>>> #SchedulerPort=7321
>>>>>
>>>>> SelectType=select/linear
>>>>>
>>>>> PreemptType=preempt/partition_prio
>>>>>
>>>>> PreemptMode=requeue
>>>>>
>>>>> #
>>>>>
>>>>> #
>>>>>
>>>>> # LOGGING AND ACCOUNTING
>>>>>
>>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>>>
>>>>> #JobAcctGatherFrequency=30
>>>>>
>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>>
>>>>> JobCompType=jobcomp/none
>>>>>
>>>>> #SlurmctldDebug=3
>>>>>
>>>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>>>
>>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>>>
>>>>> #SlurmdDebug=3
>>>>>
>>>>> SlurmdLogFile=/var/log/slurmd.log
>>>>>
>>>>> AccountingStorageBackupHost=mb
>>>>>
>>>>> #
>>>>>
>>>>> #
>>>>>
>>>>> # COMPUTE NODES
>>>>>
>>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>>>
>>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>>>
>>>>>
>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]>
>>>> [email protected]>:
>>>>
>>>>> Hi,
>>>>>
>>>>> The one problem that I see in your description is minor, and probably
>>>>> not significant: the MPI ports parameter was needed for very old versions
>>>>> of Open MPI, IIRC.
>>>>>
>>>>> To help debug your problems, please respond to this list with
>>>>>
>>>>>    1. What command did you use to invoke your program?
>>>>>    2. What versions of Slurm and OpenMPI are you using?
>>>>>    3. Did you build them yourself, or use prebuilt versions?
>>>>>    - If you built them yourself, what configuration options did you
>>>>>       use?
>>>>>       - If pre-built versions, where did you get them?
>>>>>    4. A copy of your slurm.conf file (you may want to change node
>>>>>    names and other potentially sensitive information)
>>>>>
>>>>> Andy
>>>>>
>>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I've set a basic configuration using�slurm�with
>>>>> a master node, backup node, a login node and eight compute node.
>>>>>
>>>>> Everything in�slurm�is working fine. I can
>>>>> issue jobs and see the state of the eight nodes as Idle. The problem is
>>>>> with OpenMPI. The hello parallel program where each process prints its 
>>>>> rank
>>>>> among the global set is working but when i try to establish communications
>>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there
>>>>> undefinitely.�
>>>>>
>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my
>>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof 
>>>>> -i
>>>>> shows that ptest is listening on port 1024 on both nodes, which i find
>>>>> weird since only one should be listening. Moreover, i've set slurm Mpi
>>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still
>>>>> using port 1024 ?
>>>>>
>>>>> I hope u can help me with this problem. I can't see what's
>>>>> wrong.�
>>>>> Thank you in advance.
>>>>>
>>>>> M. Acheli.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply via email to