Your slurm-OMPI integration is clearly broken - the processes do not realize they are operating in a common world. Does it work if you use mpirun instead of srun?
> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected] > <mailto:[email protected]>> wrote: > > No, I just tested another program and it seems that the world_size is reduced > to one even though i launch the job on two nodes. The hello program is doing > the same. Well, I am completely lost now. > <Capture.PNG> > > <Capture.PNG><Capture1.PNG> > > 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected] > <mailto:[email protected]>>: > This looks like a bug in your program - you specified an invalid rank when > attempting to send. > >> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected] >> <mailto:[email protected]>> wrote: >> >> I just did. Permit me to include a capture of the script output file: >> >> <Capture.PNG> >> >> I specify in my script the option "-N 2", but it looks like the world_size >> is composed of only one process and both nodes are trying to execute an >> MPI_Send ! >> >> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] >> <mailto:[email protected]>>: >> Aha! I missed it the first time... In your script, replace "mpirun" with >> "srun" and the world should be better. >> >> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >>> Euh, I did a "make all install" so I think pmi support is installed. And >>> the hello world program is working, would it if it wasn't installed ? >>> >>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] >>> <mailto:[email protected]>>: >>> For Slurm, after the "make install", did you do a "make install-contrib" >>> (which builds the pmi2 support)? I think you would have seen a runtime >>> error if you hadn't, but possibly not. >>> >>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>>> First of all, thank you for the reaction. >>>> >>>> Here are the answers : >>>> I tried multiple commands: >>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's >>>> mpi parameter to pmi2 so I no longer need the option. >>>> I also tried a script submitted via sbatch. It doesn't work either and >>>> squeue shows that it's running. My program is just passing a number from >>>> node 1 to node 2 so it doesn't normally take that long. >>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>>> I built Slurm myself with no specific options. For OpenMPI I actually >>>> downloaded it from the CentOS 7 default repo. But I tried building the >>>> same version before with --with-slurm and --with-pmi options, yet it >>>> wasn't working either.� >>>> I am joining a copy of my slurm.conf file and the script I used to submit >>>> the job. >>>> >>>> The script :� >>>> >>>> #!/bin/bash >>>> # >>>> #SBATCH --job-name=test >>>> #SBATCH --output=res_mpi.txt >>>> # >>>> #SBATCH -N 2 >>>> module load openmpi >>>> mpirun test >>>> >>>> Slurm.conf file : >>>> >>>> >>>> # slurm.conf file generated by configurator easy.html. >>>> # Put this file on all nodes of your cluster. >>>> # See the slurm.conf man page for more information. >>>> # >>>> ControlMachine=m >>>> ControlAddr=m >>>> BackupController=mb >>>> BackupAddr=mb >>>> # >>>> #MailProg=/bin/mail >>>> MpiDefault=pmi2 >>>> MpiParams=ports=12000-12999 >>>> ProctrackType=proctrack/linuxproc >>>> ReturnToService=2 >>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>> #SlurmctldPort=6817 >>>> #SlurmdPidFile=/var/run/slurmd.pid >>>> #SlurmdPort=6818 >>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>> SlurmUser=slurm >>>> #SlurmdUser=root >>>> #StateSaveLocation=/var/spool/slurm >>>> StateSaveLocation=/mnt/data/spool/slurm >>>> SwitchType=switch/none >>>> TaskPlugin=task/none >>>> # >>>> # >>>> # TIMERS >>>> #KillWait=30 >>>> #MinJobAge=300 >>>> #SlurmctldTimeout=120 >>>> #SlurmdTimeout=300 >>>> # >>>> # >>>> # SCHEDULING >>>> FastSchedule=1 >>>> SchedulerType=sched/backfill >>>> #SchedulerPort=7321 >>>> SelectType=select/linear >>>> PreemptType=preempt/partition_prio >>>> PreemptMode=requeue >>>> # >>>> # >>>> # LOGGING AND ACCOUNTING >>>> AccountingStorageType=accounting_storage/slurmdbd >>>> #JobAcctGatherFrequency=30 >>>> JobAcctGatherType=jobacct_gather/linux >>>> JobCompType=jobcomp/none >>>> #SlurmctldDebug=3 >>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>> #SlurmdDebug=3 >>>> SlurmdLogFile=/var/log/slurmd.log >>>> AccountingStorageBackupHost=mb >>>> # >>>> # >>>> # COMPUTE NODES >>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>> >>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < >>>> <mailto:[email protected]>[email protected] <mailto:[email protected]>>: >>>> Hi, >>>> >>>> The one problem that I see in your description is minor, and probably not >>>> significant: the MPI ports parameter was needed for very old versions of >>>> Open MPI, IIRC. >>>> >>>> To help debug your problems, please respond to this list with >>>> What command did you use to invoke your program? >>>> What versions of Slurm and OpenMPI are you using? >>>> Did you build them yourself, or use prebuilt versions? >>>> If you built them yourself, what configuration options did you use? >>>> If pre-built versions, where did you get them? >>>> A copy of your slurm.conf file (you may want to change node names and >>>> other potentially sensitive information) >>>> Andy >>>> >>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>>> Hello everyone, >>>>> >>>>> I've set a basic configuration using�slurm�with a >>>>> master node, backup node, a login node and eight compute node. >>>>> >>>>> Everything in�slurm�is working fine. I can issue >>>>> jobs and see the state of the eight nodes as Idle. The problem is with >>>>> OpenMPI. The hello parallel program where each process prints its rank >>>>> among the global set is working but when i try to establish >>>>> communications between nodes through MPI_Send and MPI_Recv, it just hangs >>>>> there undefinitely.� >>>>> >>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof >>>>> -i shows that ptest is listening on port 1024 on both nodes, which i find >>>>> weird since only one should be listening. Moreover, i've set slurm Mpi >>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still >>>>> using port 1024 ? >>>>> >>>>> I hope u can help me with this problem. I can't see what's >>>>> wrong.� >>>>> Thank you in advance. >>>>> >>>>> M. Acheli. >>>> >>>> >>> >>> >> >> > >
