If you can identify the name of the adaptor (e.g., “eth0”), then you can either:
* include the one you want to use: -mca oob_tcp_if_include <foo> -mca btl_tcp_if_include <foo> * exclude the Internet adaptor: -mca oob_tcp_if_exclude <bar> -mca btl_tcp_if_exclude <bar> You cannot do both at the same time. FWIW: it would help us to help you if you tell us up front that you are working with virtual machines as there are special issues when doing so :-/ > On Apr 30, 2016, at 12:51 PM, Mehdi Acheli <[email protected]> wrote: > > No, the original program didn't include a bug. It's failing due to the same > reason as the second. Since there is only one process in the world, when the > original program tries to mention another process with rank 1, it throws an > error. On the other hand, yes. It seems I have a problem on my SLURM/OMPI > integration. For the moment, I guess I'll just have to work with "salloc -> > mpirun" > Thankfully, I was able to locate the problem through "--mca plm_base_verbose > 10" option. I am running my cluster on virtual machines, each one having two > network adapters. One for the local access and the other connected to > Internet. I don't know why but OMPI tries to use the Internet network adapter > thus failing to establish communication. I had to remove the said adapter. Is > there a way to configure OMPI to avoid the problem ? > > Thank you again for your interventions. > > > > 2016-04-30 20:34 GMT+01:00 Ralph Castain <[email protected] > <mailto:[email protected]>>: > As I said, your original program has a bug in it - you are using “rank” > values that are invalid. This is why it is failing when run under mpirun. > > This second problem is caused by your SLURM integration to OMPI being broken, > probably due to not correctly linking the PMI support > > >> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <[email protected] >> <mailto:[email protected]>> wrote: >> >> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello >> world program is doing well. However my original program is still blocking >> on the send and receive lines. >> >> 2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected] >> <mailto:[email protected]>>: >> Your slurm-OMPI integration is clearly broken - the processes do not realize >> they are operating in a common world. Does it work if you use mpirun instead >> of srun? >> >> >>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> No, I just tested another program and it seems that the world_size is >>> reduced to one even though i launch the job on two nodes. The hello program >>> is doing the same. Well, I am completely lost now. >>> <Capture.PNG> >>> >>> <Capture.PNG><Capture1.PNG> >>> >>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected] >>> <mailto:[email protected]>>: >>> This looks like a bug in your program - you specified an invalid rank when >>> attempting to send. >>> >>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> I just did. Permit me to include a capture of the script output file: >>>> >>>> <Capture.PNG> >>>> >>>> I specify in my script the option "-N 2", but it looks like the world_size >>>> is composed of only one process and both nodes are trying to execute an >>>> MPI_Send ! >>>> >>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] >>>> <mailto:[email protected]>>: >>>> Aha! I missed it the first time... In your script, replace "mpirun" with >>>> "srun" and the world should be better. >>>> >>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >>>>> Euh, I did a "make all install" so I think pmi support is installed. And >>>>> the hello world program is working, would it if it wasn't installed ? >>>>> >>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] >>>>> <mailto:[email protected]>>: >>>>> For Slurm, after the "make install", did you do a "make install-contrib" >>>>> (which builds the pmi2 support)? I think you would have seen a runtime >>>>> error if you hadn't, but possibly not. >>>>> >>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>>>>> First of all, thank you for the reaction. >>>>>> >>>>>> Here are the answers : >>>>>> I tried multiple commands: >>>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the >>>>>> slurm.conf's mpi parameter to pmi2 so I no longer need the option. >>>>>> I also tried a script submitted via sbatch. It doesn't work either and >>>>>> squeue shows that it's running. My program is just passing a number from >>>>>> node 1 to node 2 so it doesn't normally take that long. >>>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>>>>> I built Slurm myself with no specific options. For OpenMPI I actually >>>>>> downloaded it from the CentOS 7 default repo. But I tried building the >>>>>> same version before with --with-slurm and --with-pmi options, yet it >>>>>> wasn't working either.� >>>>>> I am joining a copy of my slurm.conf file and the script I used to >>>>>> submit the job. >>>>>> >>>>>> The script :� >>>>>> >>>>>> #!/bin/bash >>>>>> # >>>>>> #SBATCH --job-name=test >>>>>> #SBATCH --output=res_mpi.txt >>>>>> # >>>>>> #SBATCH -N 2 >>>>>> module load openmpi >>>>>> mpirun test >>>>>> >>>>>> Slurm.conf file : >>>>>> >>>>>> >>>>>> # slurm.conf file generated by configurator easy.html. >>>>>> # Put this file on all nodes of your cluster. >>>>>> # See the slurm.conf man page for more information. >>>>>> # >>>>>> ControlMachine=m >>>>>> ControlAddr=m >>>>>> BackupController=mb >>>>>> BackupAddr=mb >>>>>> # >>>>>> #MailProg=/bin/mail >>>>>> MpiDefault=pmi2 >>>>>> MpiParams=ports=12000-12999 >>>>>> ProctrackType=proctrack/linuxproc >>>>>> ReturnToService=2 >>>>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>>>> #SlurmctldPort=6817 >>>>>> #SlurmdPidFile=/var/run/slurmd.pid >>>>>> #SlurmdPort=6818 >>>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>>>> SlurmUser=slurm >>>>>> #SlurmdUser=root >>>>>> #StateSaveLocation=/var/spool/slurm >>>>>> StateSaveLocation=/mnt/data/spool/slurm >>>>>> SwitchType=switch/none >>>>>> TaskPlugin=task/none >>>>>> # >>>>>> # >>>>>> # TIMERS >>>>>> #KillWait=30 >>>>>> #MinJobAge=300 >>>>>> #SlurmctldTimeout=120 >>>>>> #SlurmdTimeout=300 >>>>>> # >>>>>> # >>>>>> # SCHEDULING >>>>>> FastSchedule=1 >>>>>> SchedulerType=sched/backfill >>>>>> #SchedulerPort=7321 >>>>>> SelectType=select/linear >>>>>> PreemptType=preempt/partition_prio >>>>>> PreemptMode=requeue >>>>>> # >>>>>> # >>>>>> # LOGGING AND ACCOUNTING >>>>>> AccountingStorageType=accounting_storage/slurmdbd >>>>>> #JobAcctGatherFrequency=30 >>>>>> JobAcctGatherType=jobacct_gather/linux >>>>>> JobCompType=jobcomp/none >>>>>> #SlurmctldDebug=3 >>>>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>>>> #SlurmdDebug=3 >>>>>> SlurmdLogFile=/var/log/slurmd.log >>>>>> AccountingStorageBackupHost=mb >>>>>> # >>>>>> # >>>>>> # COMPUTE NODES >>>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>>>> >>>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < >>>>>> <mailto:[email protected]>[email protected] >>>>>> <mailto:[email protected]>>: >>>>>> Hi, >>>>>> >>>>>> The one problem that I see in your description is minor, and probably >>>>>> not significant: the MPI ports parameter was needed for very old >>>>>> versions of Open MPI, IIRC. >>>>>> >>>>>> To help debug your problems, please respond to this list with >>>>>> What command did you use to invoke your program? >>>>>> What versions of Slurm and OpenMPI are you using? >>>>>> Did you build them yourself, or use prebuilt versions? >>>>>> If you built them yourself, what configuration options did you use? >>>>>> If pre-built versions, where did you get them? >>>>>> A copy of your slurm.conf file (you may want to change node names and >>>>>> other potentially sensitive information) >>>>>> Andy >>>>>> >>>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>>>>> Hello everyone, >>>>>>> >>>>>>> I've set a basic configuration using�slurm�with a >>>>>>> master node, backup node, a login node and eight compute node. >>>>>>> >>>>>>> Everything in�slurm�is working fine. I can issue >>>>>>> jobs and see the state of the eight nodes as Idle. The problem is with >>>>>>> OpenMPI. The hello parallel program where each process prints its rank >>>>>>> among the global set is working but when i try to establish >>>>>>> communications between nodes through MPI_Send and MPI_Recv, it just >>>>>>> hangs there undefinitely.� >>>>>>> >>>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with >>>>>>> lsof -i shows that ptest is listening on port 1024 on both nodes, which >>>>>>> i find weird since only one should be listening. Moreover, i've set >>>>>>> slurm Mpi parameters on pmi2 and ports allowed on [12000-12999], so why >>>>>>> is it still using port 1024 ? >>>>>>> >>>>>>> I hope u can help me with this problem. I can't see what's >>>>>>> wrong.� >>>>>>> Thank you in advance. >>>>>>> >>>>>>> M. Acheli. >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
