If you can identify the name of the adaptor (e.g., “eth0”), then you can either:

* include the one you want to use: -mca oob_tcp_if_include <foo> -mca 
btl_tcp_if_include <foo>

* exclude the Internet adaptor: -mca oob_tcp_if_exclude <bar> -mca 
btl_tcp_if_exclude <bar>

You cannot do both at the same time.

FWIW: it would help us to help you if you tell us up front that you are working 
with virtual machines as there are special issues when doing so :-/


> On Apr 30, 2016, at 12:51 PM, Mehdi Acheli <[email protected]> wrote:
> 
> No, the original program didn't include a bug. It's failing due to the same 
> reason as the second. Since there is only one process in the world, when the 
> original program tries to mention another process with rank 1, it throws an 
> error. On the other hand, yes. It seems I have a problem on my SLURM/OMPI 
> integration. For the moment, I guess I'll just have to work with "salloc -> 
> mpirun" 
> Thankfully, I was able to locate the problem through "--mca plm_base_verbose 
> 10" option. I am running my cluster on virtual machines, each one having two 
> network adapters. One for the local access and the other connected to 
> Internet. I don't know why but OMPI tries to use the Internet network adapter 
> thus failing to establish communication. I had to remove the said adapter. Is 
> there a way to configure OMPI to avoid the problem ?
> 
> Thank you again for your interventions.
> 
> 
> 
> 2016-04-30 20:34 GMT+01:00 Ralph Castain <[email protected] 
> <mailto:[email protected]>>:
> As I said, your original program has a bug in it - you are using “rank” 
> values that are invalid. This is why it is failing when run under mpirun.
> 
> This second problem is caused by your SLURM integration to OMPI being broken, 
> probably due to not correctly linking the PMI support
> 
> 
>> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello 
>> world program is doing well. However my original program is still blocking 
>> on the send and receive lines.
>> 
>> 2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected] 
>> <mailto:[email protected]>>:
>> Your slurm-OMPI integration is clearly broken - the processes do not realize 
>> they are operating in a common world. Does it work if you use mpirun instead 
>> of srun?
>> 
>> 
>>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> No, I just tested another program and it seems that the world_size is 
>>> reduced to one even though i launch the job on two nodes. The hello program 
>>> is doing the same. Well, I am completely lost now.
>>> <Capture.PNG>
>>> 
>>> <Capture.PNG><Capture1.PNG>
>>> 
>>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected] 
>>> <mailto:[email protected]>>:
>>> This looks like a bug in your program - you specified an invalid rank when 
>>> attempting to send.
>>> 
>>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> I just did. Permit me to include a capture of the script output file: 
>>>> 
>>>> <Capture.PNG>
>>>> 
>>>> I specify in my script the option "-N 2", but it looks like the world_size 
>>>> is composed of only one process and both nodes are trying to execute an 
>>>> MPI_Send !
>>>> 
>>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] 
>>>> <mailto:[email protected]>>:
>>>> Aha! I missed it the first time... In your script, replace "mpirun" with 
>>>> "srun" and the world should be better.
>>>> 
>>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>>>> Euh, I did a "make all install" so I think pmi support is installed. And 
>>>>> the hello world program is working, would it if it wasn't installed ?
>>>>> 
>>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] 
>>>>> <mailto:[email protected]>>:
>>>>> For Slurm, after the "make install", did you do a "make install-contrib" 
>>>>> (which builds the pmi2 support)? I think you would have seen a runtime 
>>>>> error if you hadn't, but possibly not.
>>>>> 
>>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>>>> First of all, thank you for the reaction.
>>>>>> 
>>>>>> Here are the answers :
>>>>>> I tried multiple commands:
>>>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the 
>>>>>> slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>>>>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>>>>> squeue shows that it's running. My program is just passing a number from 
>>>>>> node 1 to node 2 so it doesn't normally take that long.
>>>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>>>>> downloaded it from the CentOS 7 default repo. But I tried building the 
>>>>>> same version before with --with-slurm and --with-pmi options, yet it 
>>>>>> wasn't working either.�
>>>>>> I am joining a copy of my slurm.conf file and the script I used to 
>>>>>> submit the job.
>>>>>> 
>>>>>> The script :�
>>>>>> 
>>>>>> #!/bin/bash
>>>>>> #
>>>>>> #SBATCH --job-name=test
>>>>>> #SBATCH --output=res_mpi.txt
>>>>>> #
>>>>>> #SBATCH -N 2
>>>>>> module load openmpi
>>>>>> mpirun test
>>>>>> 
>>>>>> Slurm.conf file :
>>>>>> 
>>>>>> 
>>>>>> # slurm.conf file generated by configurator easy.html.
>>>>>> # Put this file on all nodes of your cluster.
>>>>>> # See the slurm.conf man page for more information.
>>>>>> #
>>>>>> ControlMachine=m
>>>>>> ControlAddr=m
>>>>>> BackupController=mb
>>>>>> BackupAddr=mb
>>>>>> #
>>>>>> #MailProg=/bin/mail
>>>>>> MpiDefault=pmi2
>>>>>> MpiParams=ports=12000-12999
>>>>>> ProctrackType=proctrack/linuxproc
>>>>>> ReturnToService=2
>>>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>>>> #SlurmctldPort=6817
>>>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>>>> #SlurmdPort=6818
>>>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>>>> SlurmUser=slurm
>>>>>> #SlurmdUser=root
>>>>>> #StateSaveLocation=/var/spool/slurm
>>>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>>>> SwitchType=switch/none
>>>>>> TaskPlugin=task/none
>>>>>> #
>>>>>> #
>>>>>> # TIMERS
>>>>>> #KillWait=30
>>>>>> #MinJobAge=300
>>>>>> #SlurmctldTimeout=120
>>>>>> #SlurmdTimeout=300
>>>>>> #
>>>>>> #
>>>>>> # SCHEDULING
>>>>>> FastSchedule=1
>>>>>> SchedulerType=sched/backfill
>>>>>> #SchedulerPort=7321
>>>>>> SelectType=select/linear
>>>>>> PreemptType=preempt/partition_prio
>>>>>> PreemptMode=requeue
>>>>>> #
>>>>>> #
>>>>>> # LOGGING AND ACCOUNTING
>>>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>>>> #JobAcctGatherFrequency=30
>>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>>> JobCompType=jobcomp/none
>>>>>> #SlurmctldDebug=3
>>>>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>>>> #SlurmdDebug=3
>>>>>> SlurmdLogFile=/var/log/slurmd.log
>>>>>> AccountingStorageBackupHost=mb
>>>>>> #
>>>>>> #
>>>>>> # COMPUTE NODES
>>>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>>>> 
>>>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < 
>>>>>> <mailto:[email protected]>[email protected] 
>>>>>> <mailto:[email protected]>>:
>>>>>> Hi,
>>>>>> 
>>>>>> The one problem that I see in your description is minor, and probably 
>>>>>> not significant: the MPI ports parameter was needed for very old 
>>>>>> versions of Open MPI, IIRC.
>>>>>> 
>>>>>> To help debug your problems, please respond to this list with
>>>>>> What command did you use to invoke your program?
>>>>>> What versions of Slurm and OpenMPI are you using?
>>>>>> Did you build them yourself, or use prebuilt versions?
>>>>>> If you built them yourself, what configuration options did you use?
>>>>>> If pre-built versions, where did you get them?
>>>>>> A copy of your slurm.conf file (you may want to change node names and 
>>>>>> other potentially sensitive information)
>>>>>> Andy
>>>>>> 
>>>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>>>>> Hello everyone,
>>>>>>> 
>>>>>>> I've set a basic configuration using�slurm�with a 
>>>>>>> master node, backup node, a login node and eight compute node.
>>>>>>> 
>>>>>>> Everything in�slurm�is working fine. I can issue 
>>>>>>> jobs and see the state of the eight nodes as Idle. The problem is with 
>>>>>>> OpenMPI. The hello parallel program where each process prints its rank 
>>>>>>> among the global set is working but when i try to establish 
>>>>>>> communications between nodes through MPI_Send and MPI_Recv, it just 
>>>>>>> hangs there undefinitely.�
>>>>>>> 
>>>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my 
>>>>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with 
>>>>>>> lsof -i shows that ptest is listening on port 1024 on both nodes, which 
>>>>>>> i find weird since only one should be listening. Moreover, i've set 
>>>>>>> slurm Mpi parameters on pmi2 and ports allowed on [12000-12999], so why 
>>>>>>> is it still using port 1024 ?
>>>>>>> 
>>>>>>> I hope u can help me with this problem. I can't see what's 
>>>>>>> wrong.�
>>>>>>> Thank you in advance.
>>>>>>> 
>>>>>>> M. Acheli.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to