This looks like a bug in your program - you specified an invalid rank when 
attempting to send.

> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote:
> 
> I just did. Permit me to include a capture of the script output file: 
> 
> <Capture.PNG>
> 
> I specify in my script the option "-N 2", but it looks like the world_size is 
> composed of only one process and both nodes are trying to execute an MPI_Send 
> !
> 
> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] 
> <mailto:[email protected]>>:
> Aha! I missed it the first time... In your script, replace "mpirun" with 
> "srun" and the world should be better.
> 
> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>> Euh, I did a "make all install" so I think pmi support is installed. And the 
>> hello world program is working, would it if it wasn't installed ?
>> 
>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] 
>> <mailto:[email protected]>>:
>> For Slurm, after the "make install", did you do a "make install-contrib" 
>> (which builds the pmi2 support)? I think you would have seen a runtime error 
>> if you hadn't, but possibly not.
>> 
>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>> First of all, thank you for the reaction.
>>> 
>>> Here are the answers :
>>> I tried multiple commands:
>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's 
>>> mpi parameter to pmi2 so I no longer need the option.
>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>> squeue shows that it's running. My program is just passing a number from 
>>> node 1 to node 2 so it doesn't normally take that long.
>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>> downloaded it from the CentOS 7 default repo. But I tried building the same 
>>> version before with --with-slurm and --with-pmi options, yet it wasn't 
>>> working either.�
>>> I am joining a copy of my slurm.conf file and the script I used to submit 
>>> the job.
>>> 
>>> The script :�
>>> 
>>> #!/bin/bash
>>> #
>>> #SBATCH --job-name=test
>>> #SBATCH --output=res_mpi.txt
>>> #
>>> #SBATCH -N 2
>>> module load openmpi
>>> mpirun test
>>> 
>>> Slurm.conf file :
>>> 
>>> 
>>> # slurm.conf file generated by configurator easy.html.
>>> # Put this file on all nodes of your cluster.
>>> # See the slurm.conf man page for more information.
>>> #
>>> ControlMachine=m
>>> ControlAddr=m
>>> BackupController=mb
>>> BackupAddr=mb
>>> #
>>> #MailProg=/bin/mail
>>> MpiDefault=pmi2
>>> MpiParams=ports=12000-12999
>>> ProctrackType=proctrack/linuxproc
>>> ReturnToService=2
>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>> #SlurmctldPort=6817
>>> #SlurmdPidFile=/var/run/slurmd.pid
>>> #SlurmdPort=6818
>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>> SlurmUser=slurm
>>> #SlurmdUser=root
>>> #StateSaveLocation=/var/spool/slurm
>>> StateSaveLocation=/mnt/data/spool/slurm
>>> SwitchType=switch/none
>>> TaskPlugin=task/none
>>> #
>>> #
>>> # TIMERS
>>> #KillWait=30
>>> #MinJobAge=300
>>> #SlurmctldTimeout=120
>>> #SlurmdTimeout=300
>>> #
>>> #
>>> # SCHEDULING
>>> FastSchedule=1
>>> SchedulerType=sched/backfill
>>> #SchedulerPort=7321
>>> SelectType=select/linear
>>> PreemptType=preempt/partition_prio
>>> PreemptMode=requeue
>>> #
>>> #
>>> # LOGGING AND ACCOUNTING
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> #JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> JobCompType=jobcomp/none
>>> #SlurmctldDebug=3
>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>> #SlurmdDebug=3
>>> SlurmdLogFile=/var/log/slurmd.log
>>> AccountingStorageBackupHost=mb
>>> #
>>> #
>>> # COMPUTE NODES
>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>> 
>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < 
>>> <mailto:[email protected]>[email protected] <mailto:[email protected]>>:
>>> Hi,
>>> 
>>> The one problem that I see in your description is minor, and probably not 
>>> significant: the MPI ports parameter was needed for very old versions of 
>>> Open MPI, IIRC.
>>> 
>>> To help debug your problems, please respond to this list with
>>> What command did you use to invoke your program?
>>> What versions of Slurm and OpenMPI are you using?
>>> Did you build them yourself, or use prebuilt versions?
>>> If you built them yourself, what configuration options did you use?
>>> If pre-built versions, where did you get them?
>>> A copy of your slurm.conf file (you may want to change node names and other 
>>> potentially sensitive information)
>>> Andy
>>> 
>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>> Hello everyone,
>>>> 
>>>> I've set a basic configuration using�slurm�with a 
>>>> master node, backup node, a login node and eight compute node.
>>>> 
>>>> Everything in�slurm�is working fine. I can issue 
>>>> jobs and see the state of the eight nodes as Idle. The problem is with 
>>>> OpenMPI. The hello parallel program where each process prints its rank 
>>>> among the global set is working but when i try to establish communications 
>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there 
>>>> undefinitely.�
>>>> 
>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my 
>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof 
>>>> -i shows that ptest is listening on port 1024 on both nodes, which i find 
>>>> weird since only one should be listening. Moreover, i've set slurm Mpi 
>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still 
>>>> using port 1024 ?
>>>> 
>>>> I hope u can help me with this problem. I can't see what's 
>>>> wrong.�
>>>> Thank you in advance.
>>>> 
>>>> M. Acheli.
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to