Okay, I'll do so. Thank you very much for all the help. And sorry for the missing information, I didn't know it could originate any problem and I completely forgot to mention it.
Concerning SLURM / OMPI integration, I'll try reinstalling my configuration and keep you updated. 2016-04-30 21:01 GMT+01:00 Ralph Castain <[email protected]>: > If you can identify the name of the adaptor (e.g., “eth0”), then you can > either: > > * include the one you want to use: -mca oob_tcp_if_include <foo> -mca > btl_tcp_if_include <foo> > > * exclude the Internet adaptor: -mca oob_tcp_if_exclude <bar> -mca > btl_tcp_if_exclude <bar> > > You cannot do both at the same time. > > FWIW: it would help us to help you if you tell us up front that you are > working with virtual machines as there are special issues when doing so :-/ > > > On Apr 30, 2016, at 12:51 PM, Mehdi Acheli <[email protected]> wrote: > > No, the original program didn't include a bug. It's failing due to the > same reason as the second. Since there is only one process in the world, > when the original program tries to mention another process with rank 1, it > throws an error. On the other hand, yes. It seems I have a problem on my > SLURM/OMPI integration. For the moment, I guess I'll just have to work with > "salloc -> mpirun" > Thankfully, I was able to locate the problem through "--mca > plm_base_verbose 10" option. I am running my cluster on virtual machines, > each one having two network adapters. One for the local access and the > other connected to Internet. I don't know why but OMPI tries to use the > Internet network adapter thus failing to establish communication. I had to > remove the said adapter. Is there a way to configure OMPI to avoid the > problem ? > > Thank you again for your interventions. > > > > 2016-04-30 20:34 GMT+01:00 Ralph Castain <[email protected]>: > >> As I said, your original program has a bug in it - you are using “rank” >> values that are invalid. This is why it is failing when run under mpirun. >> >> This second problem is caused by your SLURM integration to OMPI being >> broken, probably due to not correctly linking the PMI support >> >> >> On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <[email protected]> wrote: >> >> Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the >> hello world program is doing well. However my original program is still >> blocking on the send and receive lines. >> >> 2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected]>: >> >>> Your slurm-OMPI integration is clearly broken - the processes do not >>> realize they are operating in a common world. Does it work if you use >>> mpirun instead of srun? >>> >>> >>> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected]> wrote: >>> >>> No, I just tested another program and it seems that the world_size is >>> reduced to one even though i launch the job on two nodes. The hello program >>> is doing the same. Well, I am completely lost now. >>> <Capture.PNG> >>> >>> <Capture.PNG><Capture1.PNG> >>> >>> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>: >>> >>>> This looks like a bug in your program - you specified an invalid rank >>>> when attempting to send. >>>> >>>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote: >>>> >>>> I just did. Permit me to include a capture of the script output file: >>>> >>>> <Capture.PNG> >>>> >>>> I specify in my script the option "-N 2", but it looks like the >>>> world_size is composed of only one process and both nodes are trying to >>>> execute an MPI_Send ! >>>> >>>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>: >>>> >>>>> Aha! I missed it the first time... In your script, replace "mpirun" >>>>> with "srun" and the world should be better. >>>>> >>>>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >>>>> >>>>> Euh, I did a "make all install" so I think pmi support is installed. >>>>> And the hello world program is working, would it if it wasn't installed ? >>>>> >>>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>: >>>>> >>>>>> For Slurm, after the "make install", did you do a "make >>>>>> install-contrib" (which builds the pmi2 support)? I think you would have >>>>>> seen a runtime error if you hadn't, but possibly not. >>>>>> >>>>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>>>>> >>>>>> First of all, thank you for the reaction. >>>>>> >>>>>> Here are the answers : >>>>>> >>>>>> 1. I tried multiple commands: >>>>>> 1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed >>>>>> the slurm.conf's mpi parameter to pmi2 so I no longer need the >>>>>> option. >>>>>> 2. I also tried a script submitted via sbatch. It doesn't work >>>>>> either and squeue shows that it's running. My program is just >>>>>> passing a >>>>>> number from node 1 to node 2 so it doesn't normally take that long. >>>>>> 2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>>>>> 3. I built Slurm myself with no specific options. For OpenMPI I >>>>>> actually downloaded it from the CentOS 7 default repo. But I tried >>>>>> building >>>>>> the same version before with --with-slurm and --with-pmi options, yet >>>>>> it >>>>>> wasn't working either.� >>>>>> >>>>>> I am joining a copy of my slurm.conf file and the script I used to >>>>>> submit the job. >>>>>> >>>>>> The script :� >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt >>>>>>> # #SBATCH -N 2 module load openmpi mpirun test* >>>>>> >>>>>> >>>>>> Slurm.conf file : >>>>>> >>>>>> >>>>>> # slurm.conf file generated by configurator easy.html. >>>>>>> >>>>>>> # Put this file on all nodes of your cluster. >>>>>>> >>>>>>> # See the slurm.conf man page for more information. >>>>>>> >>>>>>> # >>>>>>> >>>>>>> ControlMachine=m >>>>>>> >>>>>>> ControlAddr=m >>>>>>> >>>>>>> BackupController=mb >>>>>>> >>>>>>> BackupAddr=mb >>>>>>> >>>>>>> # >>>>>>> >>>>>>> #MailProg=/bin/mail >>>>>>> >>>>>>> MpiDefault=pmi2 >>>>>>> >>>>>>> MpiParams=ports=12000-12999 >>>>>>> >>>>>>> ProctrackType=proctrack/linuxproc >>>>>>> >>>>>>> ReturnToService=2 >>>>>>> >>>>>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>>>>> >>>>>>> #SlurmctldPort=6817 >>>>>>> >>>>>>> #SlurmdPidFile=/var/run/slurmd.pid >>>>>>> >>>>>>> #SlurmdPort=6818 >>>>>>> >>>>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>>>>> >>>>>>> SlurmUser=slurm >>>>>>> >>>>>>> #SlurmdUser=root >>>>>>> >>>>>>> #StateSaveLocation=/var/spool/slurm >>>>>>> >>>>>>> StateSaveLocation=/mnt/data/spool/slurm >>>>>>> >>>>>>> SwitchType=switch/none >>>>>>> >>>>>>> TaskPlugin=task/none >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # TIMERS >>>>>>> >>>>>>> #KillWait=30 >>>>>>> >>>>>>> #MinJobAge=300 >>>>>>> >>>>>>> #SlurmctldTimeout=120 >>>>>>> >>>>>>> #SlurmdTimeout=300 >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # SCHEDULING >>>>>>> >>>>>>> FastSchedule=1 >>>>>>> >>>>>>> SchedulerType=sched/backfill >>>>>>> >>>>>>> #SchedulerPort=7321 >>>>>>> >>>>>>> SelectType=select/linear >>>>>>> >>>>>>> PreemptType=preempt/partition_prio >>>>>>> >>>>>>> PreemptMode=requeue >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # LOGGING AND ACCOUNTING >>>>>>> >>>>>>> AccountingStorageType=accounting_storage/slurmdbd >>>>>>> >>>>>>> #JobAcctGatherFrequency=30 >>>>>>> >>>>>>> JobAcctGatherType=jobacct_gather/linux >>>>>>> >>>>>>> JobCompType=jobcomp/none >>>>>>> >>>>>>> #SlurmctldDebug=3 >>>>>>> >>>>>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>>>>> >>>>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>>>>> >>>>>>> #SlurmdDebug=3 >>>>>>> >>>>>>> SlurmdLogFile=/var/log/slurmd.log >>>>>>> >>>>>>> AccountingStorageBackupHost=mb >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # >>>>>>> >>>>>>> # COMPUTE NODES >>>>>>> >>>>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>>>>> >>>>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>>>>> >>>>>>> >>>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]> >>>>>> [email protected]>: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> The one problem that I see in your description is minor, and >>>>>>> probably not significant: the MPI ports parameter was needed for very >>>>>>> old >>>>>>> versions of Open MPI, IIRC. >>>>>>> >>>>>>> To help debug your problems, please respond to this list with >>>>>>> >>>>>>> 1. What command did you use to invoke your program? >>>>>>> 2. What versions of Slurm and OpenMPI are you using? >>>>>>> 3. Did you build them yourself, or use prebuilt versions? >>>>>>> - If you built them yourself, what configuration options did you >>>>>>> use? >>>>>>> - If pre-built versions, where did you get them? >>>>>>> 4. A copy of your slurm.conf file (you may want to change node >>>>>>> names and other potentially sensitive information) >>>>>>> >>>>>>> Andy >>>>>>> >>>>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>>>>> >>>>>>> Hello everyone, >>>>>>> >>>>>>> I've set a basic configuration using�slurm�with >>>>>>> a master node, backup node, a login node and eight compute node. >>>>>>> >>>>>>> Everything in�slurm�is working fine. I can >>>>>>> issue jobs and see the state of the eight nodes as Idle. The problem is >>>>>>> with OpenMPI. The hello parallel program where each process prints its >>>>>>> rank >>>>>>> among the global set is working but when i try to establish >>>>>>> communications >>>>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there >>>>>>> undefinitely.� >>>>>>> >>>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch >>>>>>> my parallel program, ptest, on 2 nodes : [n1, n2], a little check with >>>>>>> lsof >>>>>>> -i shows that ptest is listening on port 1024 on both nodes, which i >>>>>>> find >>>>>>> weird since only one should be listening. Moreover, i've set slurm Mpi >>>>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it >>>>>>> still >>>>>>> using port 1024 ? >>>>>>> >>>>>>> I hope u can help me with this problem. I can't see what's >>>>>>> wrong.� >>>>>>> Thank you in advance. >>>>>>> >>>>>>> M. Acheli. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
