No, I just tested another program and it seems that the world_size is reduced to one even though i launch the job on two nodes. The hello program is doing the same. Well, I am completely lost now. [image: Images intégrées 1]
[image: Images intégrées 2][image: Images intégrées 3] 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>: > This looks like a bug in your program - you specified an invalid rank when > attempting to send. > > On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote: > > I just did. Permit me to include a capture of the script output file: > > <Capture.PNG> > > I specify in my script the option "-N 2", but it looks like the world_size > is composed of only one process and both nodes are trying to execute an > MPI_Send ! > > 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>: > >> Aha! I missed it the first time... In your script, replace "mpirun" with >> "srun" and the world should be better. >> >> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >> >> Euh, I did a "make all install" so I think pmi support is installed. And >> the hello world program is working, would it if it wasn't installed ? >> >> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>: >> >>> For Slurm, after the "make install", did you do a "make install-contrib" >>> (which builds the pmi2 support)? I think you would have seen a runtime >>> error if you hadn't, but possibly not. >>> >>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>> >>> First of all, thank you for the reaction. >>> >>> Here are the answers : >>> >>> 1. I tried multiple commands: >>> 1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the >>> slurm.conf's mpi parameter to pmi2 so I no longer need the option. >>> 2. I also tried a script submitted via sbatch. It doesn't work >>> either and squeue shows that it's running. My program is just passing >>> a >>> number from node 1 to node 2 so it doesn't normally take that long. >>> 2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>> 3. I built Slurm myself with no specific options. For OpenMPI I >>> actually downloaded it from the CentOS 7 default repo. But I tried >>> building >>> the same version before with --with-slurm and --with-pmi options, yet it >>> wasn't working either.� >>> >>> I am joining a copy of my slurm.conf file and the script I used to >>> submit the job. >>> >>> The script :� >>> >>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt # >>>> #SBATCH -N 2 module load openmpi mpirun test* >>> >>> >>> Slurm.conf file : >>> >>> >>> # slurm.conf file generated by configurator easy.html. >>>> >>>> # Put this file on all nodes of your cluster. >>>> >>>> # See the slurm.conf man page for more information. >>>> >>>> # >>>> >>>> ControlMachine=m >>>> >>>> ControlAddr=m >>>> >>>> BackupController=mb >>>> >>>> BackupAddr=mb >>>> >>>> # >>>> >>>> #MailProg=/bin/mail >>>> >>>> MpiDefault=pmi2 >>>> >>>> MpiParams=ports=12000-12999 >>>> >>>> ProctrackType=proctrack/linuxproc >>>> >>>> ReturnToService=2 >>>> >>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>> >>>> #SlurmctldPort=6817 >>>> >>>> #SlurmdPidFile=/var/run/slurmd.pid >>>> >>>> #SlurmdPort=6818 >>>> >>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>> >>>> SlurmUser=slurm >>>> >>>> #SlurmdUser=root >>>> >>>> #StateSaveLocation=/var/spool/slurm >>>> >>>> StateSaveLocation=/mnt/data/spool/slurm >>>> >>>> SwitchType=switch/none >>>> >>>> TaskPlugin=task/none >>>> >>>> # >>>> >>>> # >>>> >>>> # TIMERS >>>> >>>> #KillWait=30 >>>> >>>> #MinJobAge=300 >>>> >>>> #SlurmctldTimeout=120 >>>> >>>> #SlurmdTimeout=300 >>>> >>>> # >>>> >>>> # >>>> >>>> # SCHEDULING >>>> >>>> FastSchedule=1 >>>> >>>> SchedulerType=sched/backfill >>>> >>>> #SchedulerPort=7321 >>>> >>>> SelectType=select/linear >>>> >>>> PreemptType=preempt/partition_prio >>>> >>>> PreemptMode=requeue >>>> >>>> # >>>> >>>> # >>>> >>>> # LOGGING AND ACCOUNTING >>>> >>>> AccountingStorageType=accounting_storage/slurmdbd >>>> >>>> #JobAcctGatherFrequency=30 >>>> >>>> JobAcctGatherType=jobacct_gather/linux >>>> >>>> JobCompType=jobcomp/none >>>> >>>> #SlurmctldDebug=3 >>>> >>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>> >>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>> >>>> #SlurmdDebug=3 >>>> >>>> SlurmdLogFile=/var/log/slurmd.log >>>> >>>> AccountingStorageBackupHost=mb >>>> >>>> # >>>> >>>> # >>>> >>>> # COMPUTE NODES >>>> >>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>> >>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>> >>>> >>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]> >>> [email protected]>: >>> >>>> Hi, >>>> >>>> The one problem that I see in your description is minor, and probably >>>> not significant: the MPI ports parameter was needed for very old versions >>>> of Open MPI, IIRC. >>>> >>>> To help debug your problems, please respond to this list with >>>> >>>> 1. What command did you use to invoke your program? >>>> 2. What versions of Slurm and OpenMPI are you using? >>>> 3. Did you build them yourself, or use prebuilt versions? >>>> - If you built them yourself, what configuration options did you >>>> use? >>>> - If pre-built versions, where did you get them? >>>> 4. A copy of your slurm.conf file (you may want to change node >>>> names and other potentially sensitive information) >>>> >>>> Andy >>>> >>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>> >>>> Hello everyone, >>>> >>>> I've set a basic configuration using�slurm�with >>>> a master node, backup node, a login node and eight compute node. >>>> >>>> Everything in�slurm�is working fine. I can issue >>>> jobs and see the state of the eight nodes as Idle. The problem is with >>>> OpenMPI. The hello parallel program where each process prints its rank >>>> among the global set is working but when i try to establish communications >>>> between nodes through MPI_Send and MPI_Recv, it just hangs there >>>> undefinitely.� >>>> >>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i >>>> shows that ptest is listening on port 1024 on both nodes, which i find >>>> weird since only one should be listening. Moreover, i've set slurm Mpi >>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still >>>> using port 1024 ? >>>> >>>> I hope u can help me with this problem. I can't see what's >>>> wrong.� >>>> Thank you in advance. >>>> >>>> M. Acheli. >>>> >>>> >>>> >>> >>> >> >> > >
