Okay, I'll do so.

Thank you very much for all the help. And sorry for the missing
information, I didn't know it could originate any problem and I completely
forgot to mention it.

Concerning SLURM / OMPI integration, I'll try reinstalling my configuration
and keep you updated.

2016-04-30 21:01 GMT+01:00 Ralph Castain <[email protected]>:

> If you can identify the name of the adaptor (e.g., “eth0”), then you can
> either:
>
> * include the one you want to use: -mca oob_tcp_if_include <foo> -mca
> btl_tcp_if_include <foo>
>
> * exclude the Internet adaptor: -mca oob_tcp_if_exclude <bar> -mca
> btl_tcp_if_exclude <bar>
>
> You cannot do both at the same time.
>
> FWIW: it would help us to help you if you tell us up front that you are
> working with virtual machines as there are special issues when doing so :-/
>
>
> On Apr 30, 2016, at 12:51 PM, Mehdi Acheli <[email protected]> wrote:
>
> No, the original program didn't include a bug. It's failing due to the
> same reason as the second. Since there is only one process in the world,
> when the original program tries to mention another process with rank 1, it
> throws an error. On the other hand, yes. It seems I have a problem on my
> SLURM/OMPI integration. For the moment, I guess I'll just have to work with
> "salloc -> mpirun"
> Thankfully, I was able to locate the problem through "--mca
> plm_base_verbose 10" option. I am running my cluster on virtual machines,
> each one having two network adapters. One for the local access and the
> other connected to Internet. I don't know why but OMPI tries to use the
> Internet network adapter thus failing to establish communication. I had to
> remove the said adapter. Is there a way to configure OMPI to avoid the
> problem ?
>
> Thank you again for your interventions.
>
>
>
> 2016-04-30 20:34 GMT+01:00 Ralph Castain <[email protected]>:
>
>> As I said, your original program has a bug in it - you are using “rank”
>> values that are invalid. This is why it is failing when run under mpirun.
>>
>> This second problem is caused by your SLURM integration to OMPI being
>> broken, probably due to not correctly linking the PMI support
>>
>>
>> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <[email protected]> wrote:
>>
>> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the
>> hello world program is doing well. However my original program is still
>> blocking on the send and receive lines.
>>
>> 2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected]>:
>>
>>> Your slurm-OMPI integration is clearly broken - the processes do not
>>> realize they are operating in a common world. Does it work if you use
>>> mpirun instead of srun?
>>>
>>>
>>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected]> wrote:
>>>
>>> No, I just tested another program and it seems that the world_size is
>>> reduced to one even though i launch the job on two nodes. The hello program
>>> is doing the same. Well, I am completely lost now.
>>> <Capture.PNG>
>>>
>>> <Capture.PNG><Capture1.PNG>
>>>
>>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>:
>>>
>>>> This looks like a bug in your program - you specified an invalid rank
>>>> when attempting to send.
>>>>
>>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote:
>>>>
>>>> I just did. Permit me to include a capture of the script output file:
>>>>
>>>> <Capture.PNG>
>>>>
>>>> I specify in my script the option "-N 2", but it looks like the
>>>> world_size is composed of only one process and both nodes are trying to
>>>> execute an MPI_Send !
>>>>
>>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>:
>>>>
>>>>> Aha! I missed it the first time... In your script, replace "mpirun"
>>>>> with "srun" and the world should be better.
>>>>>
>>>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>>>>
>>>>> Euh, I did a "make all install" so I think pmi support is installed.
>>>>> And the hello world program is working, would it if it wasn't installed ?
>>>>>
>>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
>>>>>
>>>>>> For Slurm, after the "make install", did you do a "make
>>>>>> install-contrib" (which builds the pmi2 support)? I think you would have
>>>>>> seen a runtime error if you hadn't, but possibly not.
>>>>>>
>>>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>>>>
>>>>>> First of all, thank you for the reaction.
>>>>>>
>>>>>> Here are the answers :
>>>>>>
>>>>>>    1. I tried multiple commands:
>>>>>>       1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed
>>>>>>       the slurm.conf's mpi parameter to pmi2 so I no longer need the 
>>>>>> option.
>>>>>>       2. I also tried a script submitted via sbatch. It doesn't work
>>>>>>       either and squeue shows that it's running. My program is just 
>>>>>> passing a
>>>>>>       number from node 1 to node 2 so it doesn't normally take that long.
>>>>>>    2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>>>>    3. I built Slurm myself with no specific options. For OpenMPI I
>>>>>>    actually downloaded it from the CentOS 7 default repo. But I tried 
>>>>>> building
>>>>>>    the same version before with --with-slurm and --with-pmi options, yet 
>>>>>> it
>>>>>>    wasn't working either.�
>>>>>>
>>>>>> I am joining a copy of my slurm.conf file and the script I used to
>>>>>> submit the job.
>>>>>>
>>>>>> The script :�
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt
>>>>>>> # #SBATCH -N 2 module load openmpi mpirun test*
>>>>>>
>>>>>>
>>>>>> Slurm.conf file :
>>>>>>
>>>>>>
>>>>>> # slurm.conf file generated by configurator easy.html.
>>>>>>>
>>>>>>> # Put this file on all nodes of your cluster.
>>>>>>>
>>>>>>> # See the slurm.conf man page for more information.
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> ControlMachine=m
>>>>>>>
>>>>>>> ControlAddr=m
>>>>>>>
>>>>>>> BackupController=mb
>>>>>>>
>>>>>>> BackupAddr=mb
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> #MailProg=/bin/mail
>>>>>>>
>>>>>>> MpiDefault=pmi2
>>>>>>>
>>>>>>> MpiParams=ports=12000-12999
>>>>>>>
>>>>>>> ProctrackType=proctrack/linuxproc
>>>>>>>
>>>>>>> ReturnToService=2
>>>>>>>
>>>>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>>>>>
>>>>>>> #SlurmctldPort=6817
>>>>>>>
>>>>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>>>>>
>>>>>>> #SlurmdPort=6818
>>>>>>>
>>>>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>>>>>
>>>>>>> SlurmUser=slurm
>>>>>>>
>>>>>>> #SlurmdUser=root
>>>>>>>
>>>>>>> #StateSaveLocation=/var/spool/slurm
>>>>>>>
>>>>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>>>>>
>>>>>>> SwitchType=switch/none
>>>>>>>
>>>>>>> TaskPlugin=task/none
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> # TIMERS
>>>>>>>
>>>>>>> #KillWait=30
>>>>>>>
>>>>>>> #MinJobAge=300
>>>>>>>
>>>>>>> #SlurmctldTimeout=120
>>>>>>>
>>>>>>> #SlurmdTimeout=300
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> # SCHEDULING
>>>>>>>
>>>>>>> FastSchedule=1
>>>>>>>
>>>>>>> SchedulerType=sched/backfill
>>>>>>>
>>>>>>> #SchedulerPort=7321
>>>>>>>
>>>>>>> SelectType=select/linear
>>>>>>>
>>>>>>> PreemptType=preempt/partition_prio
>>>>>>>
>>>>>>> PreemptMode=requeue
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> # LOGGING AND ACCOUNTING
>>>>>>>
>>>>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>>>>>
>>>>>>> #JobAcctGatherFrequency=30
>>>>>>>
>>>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>>>>
>>>>>>> JobCompType=jobcomp/none
>>>>>>>
>>>>>>> #SlurmctldDebug=3
>>>>>>>
>>>>>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>>>>>
>>>>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>>>>>
>>>>>>> #SlurmdDebug=3
>>>>>>>
>>>>>>> SlurmdLogFile=/var/log/slurmd.log
>>>>>>>
>>>>>>> AccountingStorageBackupHost=mb
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> #
>>>>>>>
>>>>>>> # COMPUTE NODES
>>>>>>>
>>>>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>>>>>
>>>>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>>>>>
>>>>>>>
>>>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]>
>>>>>> [email protected]>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> The one problem that I see in your description is minor, and
>>>>>>> probably not significant: the MPI ports parameter was needed for very 
>>>>>>> old
>>>>>>> versions of Open MPI, IIRC.
>>>>>>>
>>>>>>> To help debug your problems, please respond to this list with
>>>>>>>
>>>>>>>    1. What command did you use to invoke your program?
>>>>>>>    2. What versions of Slurm and OpenMPI are you using?
>>>>>>>    3. Did you build them yourself, or use prebuilt versions?
>>>>>>>    - If you built them yourself, what configuration options did you
>>>>>>>       use?
>>>>>>>       - If pre-built versions, where did you get them?
>>>>>>>    4. A copy of your slurm.conf file (you may want to change node
>>>>>>>    names and other potentially sensitive information)
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>>>>>
>>>>>>> Hello everyone,
>>>>>>>
>>>>>>> I've set a basic configuration using�slurm�with
>>>>>>> a master node, backup node, a login node and eight compute node.
>>>>>>>
>>>>>>> Everything in�slurm�is working fine. I can
>>>>>>> issue jobs and see the state of the eight nodes as Idle. The problem is
>>>>>>> with OpenMPI. The hello parallel program where each process prints its 
>>>>>>> rank
>>>>>>> among the global set is working but when i try to establish 
>>>>>>> communications
>>>>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there
>>>>>>> undefinitely.�
>>>>>>>
>>>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch
>>>>>>> my parallel program, ptest, on 2 nodes : [n1, n2], a little check with 
>>>>>>> lsof
>>>>>>> -i shows that ptest is listening on port 1024 on both nodes, which i 
>>>>>>> find
>>>>>>> weird since only one should be listening. Moreover, i've set slurm Mpi
>>>>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it 
>>>>>>> still
>>>>>>> using port 1024 ?
>>>>>>>
>>>>>>> I hope u can help me with this problem. I can't see what's
>>>>>>> wrong.�
>>>>>>> Thank you in advance.
>>>>>>>
>>>>>>> M. Acheli.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply via email to