This looks like a bug in your program - you specified an invalid rank when attempting to send.
> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote: > > I just did. Permit me to include a capture of the script output file: > > <Capture.PNG> > > I specify in my script the option "-N 2", but it looks like the world_size is > composed of only one process and both nodes are trying to execute an MPI_Send > ! > > 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] > <mailto:[email protected]>>: > Aha! I missed it the first time... In your script, replace "mpirun" with > "srun" and the world should be better. > > On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >> Euh, I did a "make all install" so I think pmi support is installed. And the >> hello world program is working, would it if it wasn't installed ? >> >> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] >> <mailto:[email protected]>>: >> For Slurm, after the "make install", did you do a "make install-contrib" >> (which builds the pmi2 support)? I think you would have seen a runtime error >> if you hadn't, but possibly not. >> >> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>> First of all, thank you for the reaction. >>> >>> Here are the answers : >>> I tried multiple commands: >>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's >>> mpi parameter to pmi2 so I no longer need the option. >>> I also tried a script submitted via sbatch. It doesn't work either and >>> squeue shows that it's running. My program is just passing a number from >>> node 1 to node 2 so it doesn't normally take that long. >>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>> I built Slurm myself with no specific options. For OpenMPI I actually >>> downloaded it from the CentOS 7 default repo. But I tried building the same >>> version before with --with-slurm and --with-pmi options, yet it wasn't >>> working either.� >>> I am joining a copy of my slurm.conf file and the script I used to submit >>> the job. >>> >>> The script :� >>> >>> #!/bin/bash >>> # >>> #SBATCH --job-name=test >>> #SBATCH --output=res_mpi.txt >>> # >>> #SBATCH -N 2 >>> module load openmpi >>> mpirun test >>> >>> Slurm.conf file : >>> >>> >>> # slurm.conf file generated by configurator easy.html. >>> # Put this file on all nodes of your cluster. >>> # See the slurm.conf man page for more information. >>> # >>> ControlMachine=m >>> ControlAddr=m >>> BackupController=mb >>> BackupAddr=mb >>> # >>> #MailProg=/bin/mail >>> MpiDefault=pmi2 >>> MpiParams=ports=12000-12999 >>> ProctrackType=proctrack/linuxproc >>> ReturnToService=2 >>> #SlurmctldPidFile=/var/run/slurmctld.pid >>> #SlurmctldPort=6817 >>> #SlurmdPidFile=/var/run/slurmd.pid >>> #SlurmdPort=6818 >>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>> SlurmUser=slurm >>> #SlurmdUser=root >>> #StateSaveLocation=/var/spool/slurm >>> StateSaveLocation=/mnt/data/spool/slurm >>> SwitchType=switch/none >>> TaskPlugin=task/none >>> # >>> # >>> # TIMERS >>> #KillWait=30 >>> #MinJobAge=300 >>> #SlurmctldTimeout=120 >>> #SlurmdTimeout=300 >>> # >>> # >>> # SCHEDULING >>> FastSchedule=1 >>> SchedulerType=sched/backfill >>> #SchedulerPort=7321 >>> SelectType=select/linear >>> PreemptType=preempt/partition_prio >>> PreemptMode=requeue >>> # >>> # >>> # LOGGING AND ACCOUNTING >>> AccountingStorageType=accounting_storage/slurmdbd >>> #JobAcctGatherFrequency=30 >>> JobAcctGatherType=jobacct_gather/linux >>> JobCompType=jobcomp/none >>> #SlurmctldDebug=3 >>> #SlurmctldLogFile=/var/log/slurmctld.log >>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>> #SlurmdDebug=3 >>> SlurmdLogFile=/var/log/slurmd.log >>> AccountingStorageBackupHost=mb >>> # >>> # >>> # COMPUTE NODES >>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>> >>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < >>> <mailto:[email protected]>[email protected] <mailto:[email protected]>>: >>> Hi, >>> >>> The one problem that I see in your description is minor, and probably not >>> significant: the MPI ports parameter was needed for very old versions of >>> Open MPI, IIRC. >>> >>> To help debug your problems, please respond to this list with >>> What command did you use to invoke your program? >>> What versions of Slurm and OpenMPI are you using? >>> Did you build them yourself, or use prebuilt versions? >>> If you built them yourself, what configuration options did you use? >>> If pre-built versions, where did you get them? >>> A copy of your slurm.conf file (you may want to change node names and other >>> potentially sensitive information) >>> Andy >>> >>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>> Hello everyone, >>>> >>>> I've set a basic configuration using�slurm�with a >>>> master node, backup node, a login node and eight compute node. >>>> >>>> Everything in�slurm�is working fine. I can issue >>>> jobs and see the state of the eight nodes as Idle. The problem is with >>>> OpenMPI. The hello parallel program where each process prints its rank >>>> among the global set is working but when i try to establish communications >>>> between nodes through MPI_Send and MPI_Recv, it just hangs there >>>> undefinitely.� >>>> >>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof >>>> -i shows that ptest is listening on port 1024 on both nodes, which i find >>>> weird since only one should be listening. Moreover, i've set slurm Mpi >>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still >>>> using port 1024 ? >>>> >>>> I hope u can help me with this problem. I can't see what's >>>> wrong.� >>>> Thank you in advance. >>>> >>>> M. Acheli. >>> >>> >> >> > >
