Your slurm-OMPI integration is clearly broken - the processes do not realize 
they are operating in a common world. Does it work if you use mpirun instead of 
srun?


> On Apr 30, 2016, at 11:28 AM, Mehdi Acheli <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> No, I just tested another program and it seems that the world_size is reduced 
> to one even though i launch the job on two nodes. The hello program is doing 
> the same. Well, I am completely lost now.
> <Capture.PNG>
> 
> <Capture.PNG><Capture1.PNG>
> 
> 2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected] 
> <mailto:[email protected]>>:
> This looks like a bug in your program - you specified an invalid rank when 
> attempting to send.
> 
>> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I just did. Permit me to include a capture of the script output file: 
>> 
>> <Capture.PNG>
>> 
>> I specify in my script the option "-N 2", but it looks like the world_size 
>> is composed of only one process and both nodes are trying to execute an 
>> MPI_Send !
>> 
>> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected] 
>> <mailto:[email protected]>>:
>> Aha! I missed it the first time... In your script, replace "mpirun" with 
>> "srun" and the world should be better.
>> 
>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>> Euh, I did a "make all install" so I think pmi support is installed. And 
>>> the hello world program is working, would it if it wasn't installed ?
>>> 
>>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected] 
>>> <mailto:[email protected]>>:
>>> For Slurm, after the "make install", did you do a "make install-contrib" 
>>> (which builds the pmi2 support)? I think you would have seen a runtime 
>>> error if you hadn't, but possibly not.
>>> 
>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>> First of all, thank you for the reaction.
>>>> 
>>>> Here are the answers :
>>>> I tried multiple commands:
>>>> I started with "srun -N2 --mpi=pmi2 ptest" then I changed the slurm.conf's 
>>>> mpi parameter to pmi2 so I no longer need the option.
>>>> I also tried a script submitted via sbatch. It doesn't work either and 
>>>> squeue shows that it's running. My program is just passing a number from 
>>>> node 1 to node 2 so it doesn't normally take that long.
>>>> OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>> I built Slurm myself with no specific options. For OpenMPI I actually 
>>>> downloaded it from the CentOS 7 default repo. But I tried building the 
>>>> same version before with --with-slurm and --with-pmi options, yet it 
>>>> wasn't working either.�
>>>> I am joining a copy of my slurm.conf file and the script I used to submit 
>>>> the job.
>>>> 
>>>> The script :�
>>>> 
>>>> #!/bin/bash
>>>> #
>>>> #SBATCH --job-name=test
>>>> #SBATCH --output=res_mpi.txt
>>>> #
>>>> #SBATCH -N 2
>>>> module load openmpi
>>>> mpirun test
>>>> 
>>>> Slurm.conf file :
>>>> 
>>>> 
>>>> # slurm.conf file generated by configurator easy.html.
>>>> # Put this file on all nodes of your cluster.
>>>> # See the slurm.conf man page for more information.
>>>> #
>>>> ControlMachine=m
>>>> ControlAddr=m
>>>> BackupController=mb
>>>> BackupAddr=mb
>>>> #
>>>> #MailProg=/bin/mail
>>>> MpiDefault=pmi2
>>>> MpiParams=ports=12000-12999
>>>> ProctrackType=proctrack/linuxproc
>>>> ReturnToService=2
>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>> #SlurmctldPort=6817
>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>> #SlurmdPort=6818
>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>> SlurmUser=slurm
>>>> #SlurmdUser=root
>>>> #StateSaveLocation=/var/spool/slurm
>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>> SwitchType=switch/none
>>>> TaskPlugin=task/none
>>>> #
>>>> #
>>>> # TIMERS
>>>> #KillWait=30
>>>> #MinJobAge=300
>>>> #SlurmctldTimeout=120
>>>> #SlurmdTimeout=300
>>>> #
>>>> #
>>>> # SCHEDULING
>>>> FastSchedule=1
>>>> SchedulerType=sched/backfill
>>>> #SchedulerPort=7321
>>>> SelectType=select/linear
>>>> PreemptType=preempt/partition_prio
>>>> PreemptMode=requeue
>>>> #
>>>> #
>>>> # LOGGING AND ACCOUNTING
>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>> #JobAcctGatherFrequency=30
>>>> JobAcctGatherType=jobacct_gather/linux
>>>> JobCompType=jobcomp/none
>>>> #SlurmctldDebug=3
>>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>> #SlurmdDebug=3
>>>> SlurmdLogFile=/var/log/slurmd.log
>>>> AccountingStorageBackupHost=mb
>>>> #
>>>> #
>>>> # COMPUTE NODES
>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>> 
>>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < 
>>>> <mailto:[email protected]>[email protected] <mailto:[email protected]>>:
>>>> Hi,
>>>> 
>>>> The one problem that I see in your description is minor, and probably not 
>>>> significant: the MPI ports parameter was needed for very old versions of 
>>>> Open MPI, IIRC.
>>>> 
>>>> To help debug your problems, please respond to this list with
>>>> What command did you use to invoke your program?
>>>> What versions of Slurm and OpenMPI are you using?
>>>> Did you build them yourself, or use prebuilt versions?
>>>> If you built them yourself, what configuration options did you use?
>>>> If pre-built versions, where did you get them?
>>>> A copy of your slurm.conf file (you may want to change node names and 
>>>> other potentially sensitive information)
>>>> Andy
>>>> 
>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>>> Hello everyone,
>>>>> 
>>>>> I've set a basic configuration using�slurm�with a 
>>>>> master node, backup node, a login node and eight compute node.
>>>>> 
>>>>> Everything in�slurm�is working fine. I can issue 
>>>>> jobs and see the state of the eight nodes as Idle. The problem is with 
>>>>> OpenMPI. The hello parallel program where each process prints its rank 
>>>>> among the global set is working but when i try to establish 
>>>>> communications between nodes through MPI_Send and MPI_Recv, it just hangs 
>>>>> there undefinitely.�
>>>>> 
>>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my 
>>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof 
>>>>> -i shows that ptest is listening on port 1024 on both nodes, which i find 
>>>>> weird since only one should be listening. Moreover, i've set slurm Mpi 
>>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still 
>>>>> using port 1024 ?
>>>>> 
>>>>> I hope u can help me with this problem. I can't see what's 
>>>>> wrong.�
>>>>> Thank you in advance.
>>>>> 
>>>>> M. Acheli.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to