As I said, your original program has a bug in it - you are using “rank” values that are invalid. This is why it is failing when run under mpirun.
This second problem is caused by your SLURM integration to OMPI being broken, probably due to not correctly linking the PMI support > On Apr 30, 2016, at 11:56 AM, Mehdi Acheli <[email protected]> wrote: > > Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello > world program is doing well. However my original program is still blocking on > the send and receive lines. > > 2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected] > <mailto:[email protected]>>: > Your slurm-OMPI integration is clearly broken - the processes do not realize > they are operating in a common world. Does it work if you use mpirun instead > of srun? > > >> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected] >> <mailto:[email protected]>> wrote: >> >> No, I just tested another program and it seems that the world_size is >> reduced to one even though i launch the job on two nodes. The hello program >> is doing the same. Well, I am completely lost now. >> <Capture.PNG> >> >> <Capture.PNG><Capture1.PNG> >> >> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected] >> <mailto:[email protected]>>: >> This looks like a bug in your program - you specified an invalid rank when >> attempting to send. >> >>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I just did. Permit me to include a capture of the script output file: >>> >>> <Capture.PNG> >>> >>> I specify in my script the option "-N 2", but it looks like the world_size >>> is composed of only one process and both nodes are trying to execute an >>> MPI_Send ! >>> >>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] >>> <mailto:[email protected]>>: >>> Aha! I missed it the first time... In your script, replace "mpirun" with >>> "srun" and the world should be better. >>> >>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >>>> Euh, I did a "make all install" so I think pmi support is installed. And >>>> the hello world program is working, would it if it wasn't installed ? >>>> >>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] >>>> <mailto:[email protected]>>: >>>> For Slurm, after the "make install", did you do a "make install-contrib" >>>> (which builds the pmi2 support)? I think you would have seen a runtime >>>> error if you hadn't, but possibly not. >>>> >>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>>>> First of all, thank you for the reaction. >>>>> >>>>> Here are the answers : >>>>> I tried multiple commands: >>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the >>>>> slurm.conf's mpi parameter to pmi2 so I no longer need the option. >>>>> I also tried a script submitted via sbatch. It doesn't work either and >>>>> squeue shows that it's running. My program is just passing a number from >>>>> node 1 to node 2 so it doesn't normally take that long. >>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>>>> I built Slurm myself with no specific options. For OpenMPI I actually >>>>> downloaded it from the CentOS 7 default repo. But I tried building the >>>>> same version before with --with-slurm and --with-pmi options, yet it >>>>> wasn't working either.� >>>>> I am joining a copy of my slurm.conf file and the script I used to submit >>>>> the job. >>>>> >>>>> The script :� >>>>> >>>>> #!/bin/bash >>>>> # >>>>> #SBATCH --job-name=test >>>>> #SBATCH --output=res_mpi.txt >>>>> # >>>>> #SBATCH -N 2 >>>>> module load openmpi >>>>> mpirun test >>>>> >>>>> Slurm.conf file : >>>>> >>>>> >>>>> # slurm.conf file generated by configurator easy.html. >>>>> # Put this file on all nodes of your cluster. >>>>> # See the slurm.conf man page for more information. >>>>> # >>>>> ControlMachine=m >>>>> ControlAddr=m >>>>> BackupController=mb >>>>> BackupAddr=mb >>>>> # >>>>> #MailProg=/bin/mail >>>>> MpiDefault=pmi2 >>>>> MpiParams=ports=12000-12999 >>>>> ProctrackType=proctrack/linuxproc >>>>> ReturnToService=2 >>>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>>> #SlurmctldPort=6817 >>>>> #SlurmdPidFile=/var/run/slurmd.pid >>>>> #SlurmdPort=6818 >>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>>> SlurmUser=slurm >>>>> #SlurmdUser=root >>>>> #StateSaveLocation=/var/spool/slurm >>>>> StateSaveLocation=/mnt/data/spool/slurm >>>>> SwitchType=switch/none >>>>> TaskPlugin=task/none >>>>> # >>>>> # >>>>> # TIMERS >>>>> #KillWait=30 >>>>> #MinJobAge=300 >>>>> #SlurmctldTimeout=120 >>>>> #SlurmdTimeout=300 >>>>> # >>>>> # >>>>> # SCHEDULING >>>>> FastSchedule=1 >>>>> SchedulerType=sched/backfill >>>>> #SchedulerPort=7321 >>>>> SelectType=select/linear >>>>> PreemptType=preempt/partition_prio >>>>> PreemptMode=requeue >>>>> # >>>>> # >>>>> # LOGGING AND ACCOUNTING >>>>> AccountingStorageType=accounting_storage/slurmdbd >>>>> #JobAcctGatherFrequency=30 >>>>> JobAcctGatherType=jobacct_gather/linux >>>>> JobCompType=jobcomp/none >>>>> #SlurmctldDebug=3 >>>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>>> #SlurmdDebug=3 >>>>> SlurmdLogFile=/var/log/slurmd.log >>>>> AccountingStorageBackupHost=mb >>>>> # >>>>> # >>>>> # COMPUTE NODES >>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>>> >>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < >>>>> <mailto:[email protected]>[email protected] >>>>> <mailto:[email protected]>>: >>>>> Hi, >>>>> >>>>> The one problem that I see in your description is minor, and probably not >>>>> significant: the MPI ports parameter was needed for very old versions of >>>>> Open MPI, IIRC. >>>>> >>>>> To help debug your problems, please respond to this list with >>>>> What command did you use to invoke your program? >>>>> What versions of Slurm and OpenMPI are you using? >>>>> Did you build them yourself, or use prebuilt versions? >>>>> If you built them yourself, what configuration options did you use? >>>>> If pre-built versions, where did you get them? >>>>> A copy of your slurm.conf file (you may want to change node names and >>>>> other potentially sensitive information) >>>>> Andy >>>>> >>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>>>> Hello everyone, >>>>>> >>>>>> I've set a basic configuration using�slurm�with a >>>>>> master node, backup node, a login node and eight compute node. >>>>>> >>>>>> Everything in�slurm�is working fine. I can issue >>>>>> jobs and see the state of the eight nodes as Idle. The problem is with >>>>>> OpenMPI. The hello parallel program where each process prints its rank >>>>>> among the global set is working but when i try to establish >>>>>> communications between nodes through MPI_Send and MPI_Recv, it just >>>>>> hangs there undefinitely.� >>>>>> >>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof >>>>>> -i shows that ptest is listening on port 1024 on both nodes, which i >>>>>> find weird since only one should be listening. Moreover, i've set slurm >>>>>> Mpi parameters on pmi2 and ports allowed on [12000-12999], so why is it >>>>>> still using port 1024 ? >>>>>> >>>>>> I hope u can help me with this problem. I can't see what's >>>>>> wrong.� >>>>>> Thank you in advance. >>>>>> >>>>>> M. Acheli. >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
