I just did. Permit me to include a capture of the script output file: [image: Images intégrées 1]
I specify in my script the option "-N 2", but it looks like the world_size is composed of only one process and both nodes are trying to execute an MPI_Send ! 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>: > Aha! I missed it the first time... In your script, replace "mpirun" with > "srun" and the world should be better. > > On 04/30/2016 01:35 PM, Mehdi Acheli wrote: > > Euh, I did a "make all install" so I think pmi support is installed. And > the hello world program is working, would it if it wasn't installed ? > > 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>: > >> For Slurm, after the "make install", did you do a "make install-contrib" >> (which builds the pmi2 support)? I think you would have seen a runtime >> error if you hadn't, but possibly not. >> >> On 04/30/2016 12:14 PM, Mehdi Acheli wrote: >> >> First of all, thank you for the reaction. >> >> Here are the answers : >> >> 1. I tried multiple commands: >> 1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the >> slurm.conf's mpi parameter to pmi2 so I no longer need the option. >> 2. I also tried a script submitted via sbatch. It doesn't work >> either and squeue shows that it's running. My program is just passing a >> number from node 1 to node 2 so it doesn't normally take that long. >> 2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8 >> 3. I built Slurm myself with no specific options. For OpenMPI I >> actually downloaded it from the CentOS 7 default repo. But I tried >> building >> the same version before with --with-slurm and --with-pmi options, yet it >> wasn't working either.� >> >> I am joining a copy of my slurm.conf file and the script I used to submit >> the job. >> >> The script :� >> >> >>> >>> >>> >>> >>> >>> >>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt # >>> #SBATCH -N 2 module load openmpi mpirun test* >> >> >> Slurm.conf file : >> >> >> # slurm.conf file generated by configurator easy.html. >>> >>> # Put this file on all nodes of your cluster. >>> >>> # See the slurm.conf man page for more information. >>> >>> # >>> >>> ControlMachine=m >>> >>> ControlAddr=m >>> >>> BackupController=mb >>> >>> BackupAddr=mb >>> >>> # >>> >>> #MailProg=/bin/mail >>> >>> MpiDefault=pmi2 >>> >>> MpiParams=ports=12000-12999 >>> >>> ProctrackType=proctrack/linuxproc >>> >>> ReturnToService=2 >>> >>> #SlurmctldPidFile=/var/run/slurmctld.pid >>> >>> #SlurmctldPort=6817 >>> >>> #SlurmdPidFile=/var/run/slurmd.pid >>> >>> #SlurmdPort=6818 >>> >>> SlurmdSpoolDir=/var/spool/slurm/slurmd >>> >>> SlurmUser=slurm >>> >>> #SlurmdUser=root >>> >>> #StateSaveLocation=/var/spool/slurm >>> >>> StateSaveLocation=/mnt/data/spool/slurm >>> >>> SwitchType=switch/none >>> >>> TaskPlugin=task/none >>> >>> # >>> >>> # >>> >>> # TIMERS >>> >>> #KillWait=30 >>> >>> #MinJobAge=300 >>> >>> #SlurmctldTimeout=120 >>> >>> #SlurmdTimeout=300 >>> >>> # >>> >>> # >>> >>> # SCHEDULING >>> >>> FastSchedule=1 >>> >>> SchedulerType=sched/backfill >>> >>> #SchedulerPort=7321 >>> >>> SelectType=select/linear >>> >>> PreemptType=preempt/partition_prio >>> >>> PreemptMode=requeue >>> >>> # >>> >>> # >>> >>> # LOGGING AND ACCOUNTING >>> >>> AccountingStorageType=accounting_storage/slurmdbd >>> >>> #JobAcctGatherFrequency=30 >>> >>> JobAcctGatherType=jobacct_gather/linux >>> >>> JobCompType=jobcomp/none >>> >>> #SlurmctldDebug=3 >>> >>> #SlurmctldLogFile=/var/log/slurmctld.log >>> >>> SlurmctldLogFile=/mnt/data/log/slurmctld.log >>> >>> #SlurmdDebug=3 >>> >>> SlurmdLogFile=/var/log/slurmd.log >>> >>> AccountingStorageBackupHost=mb >>> >>> # >>> >>> # >>> >>> # COMPUTE NODES >>> >>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN >>> >>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN >>> >>> >> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]> >> [email protected]>: >> >>> Hi, >>> >>> The one problem that I see in your description is minor, and probably >>> not significant: the MPI ports parameter was needed for very old versions >>> of Open MPI, IIRC. >>> >>> To help debug your problems, please respond to this list with >>> >>> 1. What command did you use to invoke your program? >>> 2. What versions of Slurm and OpenMPI are you using? >>> 3. Did you build them yourself, or use prebuilt versions? >>> - If you built them yourself, what configuration options did you use? >>> - If pre-built versions, where did you get them? >>> 4. A copy of your slurm.conf file (you may want to change node names >>> and other potentially sensitive information) >>> >>> Andy >>> >>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote: >>> >>> Hello everyone, >>> >>> I've set a basic configuration using�slurm�with a >>> master node, backup node, a login node and eight compute node. >>> >>> Everything in�slurm�is working fine. I can issue >>> jobs and see the state of the eight nodes as Idle. The problem is with >>> OpenMPI. The hello parallel program where each process prints its rank >>> among the global set is working but when i try to establish communications >>> between nodes through MPI_Send and MPI_Recv, it just hangs there >>> undefinitely.� >>> >>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my >>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i >>> shows that ptest is listening on port 1024 on both nodes, which i find >>> weird since only one should be listening. Moreover, i've set slurm Mpi >>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still >>> using port 1024 ? >>> >>> I hope u can help me with this problem. I can't see what's >>> wrong.� >>> Thank you in advance. >>> >>> M. Acheli. >>> >>> >>> >> >> > >
