Yes, if I use "salloc -N2 sh" and then launch the job via mpirun, the hello world program is doing well. However my original program is still blocking on the send and receive lines.
2016-04-30 19:47 GMT+01:00 Ralph Castain <[email protected]>: > Your slurm-OMPI integration is clearly broken - the processes do not > realize they are operating in a common world. Does it work if you use > mpirun instead of srun? > > > On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected]> wrote: > > No, I just tested another program and it seems that the world_size is > reduced to one even though i launch the job on two nodes. The hello program > is doing the same. Well, I am completely lost now. > <Capture.PNG> > > <Capture.PNG><Capture1.PNG> > > 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>: > >> This looks like a bug in your program - you specified an invalid rank >> when attempting to send. >> >> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote: >> >> I just did. Permit me to include a capture of the script output file: >> >> <Capture.PNG> >> >> I specify in my script the option "-N 2", but it looks like the >> world_size is composed of only one process and both nodes are trying to >> execute an MPI_Send ! >> >> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>: >> >>> Aha! I missed it the first time... In your script, replace "mpirun" with >>> "srun" and the world should be better. >>> >>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote: >>> >>> Euh, I did a "make all install" so I think pmi support is installed. And >>> the hello world program is working, would it if it wasn't installed ? >>> >>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>: >>> >>>> For Slurm, after the "make install", did you do a "make >>>> install-contrib" (which builds the pmi2 support)? I think you would have >>>> seen a runtime error if you hadn't, but possibly not. >>>> >>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >>>> >>>> First of all, thank you for the reaction. >>>> >>>> Here are the answers : >>>> >>>> 1. I tried multiple commands: >>>> 1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the >>>> slurm.conf's mpi parameter to pmi2 so I no longer need the option. >>>> 2. I also tried a script submitted via sbatch. It doesn't work >>>> either and squeue shows that it's running. My program is just >>>> passing a >>>> number from node 1 to node 2 so it doesn't normally take that long. >>>> 2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >>>> 3. I built Slurm myself with no specific options. For OpenMPI I >>>> actually downloaded it from the CentOS 7 default repo. But I tried >>>> building >>>> the same version before with --with-slurm and --with-pmi options, yet it >>>> wasn't working either.� >>>> >>>> I am joining a copy of my slurm.conf file and the script I used to >>>> submit the job. >>>> >>>> The script :� >>>> >>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt # >>>>> #SBATCH -N 2 module load openmpi mpirun test* >>>> >>>> >>>> Slurm.conf file : >>>> >>>> >>>> # slurm.conf file generated by configurator easy.html. >>>>> >>>>> # Put this file on all nodes of your cluster. >>>>> >>>>> # See the slurm.conf man page for more information. >>>>> >>>>> # >>>>> >>>>> ControlMachine=m >>>>> >>>>> ControlAddr=m >>>>> >>>>> BackupController=mb >>>>> >>>>> BackupAddr=mb >>>>> >>>>> # >>>>> >>>>> #MailProg=/bin/mail >>>>> >>>>> MpiDefault=pmi2 >>>>> >>>>> MpiParams=ports=12000-12999 >>>>> >>>>> ProctrackType=proctrack/linuxproc >>>>> >>>>> ReturnToService=2 >>>>> >>>>> #SlurmctldPidFile=/var/run/slurmctld.pid >>>>> >>>>> #SlurmctldPort=6817 >>>>> >>>>> #SlurmdPidFile=/var/run/slurmd.pid >>>>> >>>>> #SlurmdPort=6818 >>>>> >>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>>>> >>>>> SlurmUser=slurm >>>>> >>>>> #SlurmdUser=root >>>>> >>>>> #StateSaveLocation=/var/spool/slurm >>>>> >>>>> StateSaveLocation=/mnt/data/spool/slurm >>>>> >>>>> SwitchType=switch/none >>>>> >>>>> TaskPlugin=task/none >>>>> >>>>> # >>>>> >>>>> # >>>>> >>>>> # TIMERS >>>>> >>>>> #KillWait=30 >>>>> >>>>> #MinJobAge=300 >>>>> >>>>> #SlurmctldTimeout=120 >>>>> >>>>> #SlurmdTimeout=300 >>>>> >>>>> # >>>>> >>>>> # >>>>> >>>>> # SCHEDULING >>>>> >>>>> FastSchedule=1 >>>>> >>>>> SchedulerType=sched/backfill >>>>> >>>>> #SchedulerPort=7321 >>>>> >>>>> SelectType=select/linear >>>>> >>>>> PreemptType=preempt/partition_prio >>>>> >>>>> PreemptMode=requeue >>>>> >>>>> # >>>>> >>>>> # >>>>> >>>>> # LOGGING AND ACCOUNTING >>>>> >>>>> AccountingStorageType=accounting_storage/slurmdbd >>>>> >>>>> #JobAcctGatherFrequency=30 >>>>> >>>>> JobAcctGatherType=jobacct_gather/linux >>>>> >>>>> JobCompType=jobcomp/none >>>>> >>>>> #SlurmctldDebug=3 >>>>> >>>>> #SlurmctldLogFile=/var/log/slurmctld.log >>>>> >>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>>>> >>>>> #SlurmdDebug=3 >>>>> >>>>> SlurmdLogFile=/var/log/slurmd.log >>>>> >>>>> AccountingStorageBackupHost=mb >>>>> >>>>> # >>>>> >>>>> # >>>>> >>>>> # COMPUTE NODES >>>>> >>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>>>> >>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>>>> >>>>> >>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]> >>>> [email protected]>: >>>> >>>>> Hi, >>>>> >>>>> The one problem that I see in your description is minor, and probably >>>>> not significant: the MPI ports parameter was needed for very old versions >>>>> of Open MPI, IIRC. >>>>> >>>>> To help debug your problems, please respond to this list with >>>>> >>>>> 1. What command did you use to invoke your program? >>>>> 2. What versions of Slurm and OpenMPI are you using? >>>>> 3. Did you build them yourself, or use prebuilt versions? >>>>> - If you built them yourself, what configuration options did you >>>>> use? >>>>> - If pre-built versions, where did you get them? >>>>> 4. A copy of your slurm.conf file (you may want to change node >>>>> names and other potentially sensitive information) >>>>> >>>>> Andy >>>>> >>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>>>> >>>>> Hello everyone, >>>>> >>>>> I've set a basic configuration using�slurm�with >>>>> a master node, backup node, a login node and eight compute node. >>>>> >>>>> Everything in�slurm�is working fine. I can >>>>> issue jobs and see the state of the eight nodes as Idle. The problem is >>>>> with OpenMPI. The hello parallel program where each process prints its >>>>> rank >>>>> among the global set is working but when i try to establish communications >>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there >>>>> undefinitely.� >>>>> >>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof >>>>> -i >>>>> shows that ptest is listening on port 1024 on both nodes, which i find >>>>> weird since only one should be listening. Moreover, i've set slurm Mpi >>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still >>>>> using port 1024 ? >>>>> >>>>> I hope u can help me with this problem. I can't see what's >>>>> wrong.� >>>>> Thank you in advance. >>>>> >>>>> M. Acheli. >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
