No, I just tested another program and it seems that the world_size is
reduced to one even though i launch the job on two nodes. The hello program
is doing the same. Well, I am completely lost now.
[image: Images intégrées 1]

[image: Images intégrées 2][image: Images intégrées 3]

2016-04-30 19:09 GMT+01:00 Ralph Castain <[email protected]>:

> This looks like a bug in your program - you specified an invalid rank when
> attempting to send.
>
> On Apr 30, 2016, at 10:59 AM, Mehdi Acheli <[email protected]> wrote:
>
> I just did. Permit me to include a capture of the script output file:
>
> <Capture.PNG>
>
> I specify in my script the option "-N 2", but it looks like the world_size
> is composed of only one process and both nodes are trying to execute an
> MPI_Send !
>
> 2016-04-30 18:41 GMT+01:00 Andy Riebs <[email protected]>:
>
>> Aha! I missed it the first time... In your script, replace "mpirun" with
>> "srun" and the world should be better.
>>
>> On 04/30/2016 01:35 PM, Mehdi Acheli wrote:
>>
>> Euh, I did a "make all install" so I think pmi support is installed. And
>> the hello world program is working, would it if it wasn't installed ?
>>
>> 2016-04-30 18:04 GMT+01:00 Andy Riebs <[email protected]>:
>>
>>> For Slurm, after the "make install", did you do a "make install-contrib"
>>> (which builds the pmi2 support)? I think you would have seen a runtime
>>> error if you hadn't, but possibly not.
>>>
>>> On 04/30/2016 12:14 PM, Mehdi Acheli wrote:
>>>
>>> First of all, thank you for the reaction.
>>>
>>> Here are the answers :
>>>
>>>    1. I tried multiple commands:
>>>       1. I started with "srun -N2 --mpi=pmi2 ptest" then I changed the
>>>       slurm.conf's mpi parameter to pmi2 so I no longer need the option.
>>>       2. I also tried a script submitted via sbatch. It doesn't work
>>>       either and squeue shows that it's running. My program is just passing 
>>> a
>>>       number from node 1 to node 2 so it doesn't normally take that long.
>>>    2. OpenMPI version is 1.10.2 / SLURM's is 15.08.8
>>>    3. I built Slurm myself with no specific options. For OpenMPI I
>>>    actually downloaded it from the CentOS 7 default repo. But I tried 
>>> building
>>>    the same version before with --with-slurm and --with-pmi options, yet it
>>>    wasn't working either.�
>>>
>>> I am joining a copy of my slurm.conf file and the script I used to
>>> submit the job.
>>>
>>> The script :�
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *#!/bin/bash # #SBATCH --job-name=test #SBATCH --output=res_mpi.txt #
>>>> #SBATCH -N 2 module load openmpi mpirun test*
>>>
>>>
>>> Slurm.conf file :
>>>
>>>
>>> # slurm.conf file generated by configurator easy.html.
>>>>
>>>> # Put this file on all nodes of your cluster.
>>>>
>>>> # See the slurm.conf man page for more information.
>>>>
>>>> #
>>>>
>>>> ControlMachine=m
>>>>
>>>> ControlAddr=m
>>>>
>>>> BackupController=mb
>>>>
>>>> BackupAddr=mb
>>>>
>>>> #
>>>>
>>>> #MailProg=/bin/mail
>>>>
>>>> MpiDefault=pmi2
>>>>
>>>> MpiParams=ports=12000-12999
>>>>
>>>> ProctrackType=proctrack/linuxproc
>>>>
>>>> ReturnToService=2
>>>>
>>>> #SlurmctldPidFile=/var/run/slurmctld.pid
>>>>
>>>> #SlurmctldPort=6817
>>>>
>>>> #SlurmdPidFile=/var/run/slurmd.pid
>>>>
>>>> #SlurmdPort=6818
>>>>
>>>> SlurmdSpoolDir=/var/spool/slurm/slurmd
>>>>
>>>> SlurmUser=slurm
>>>>
>>>> #SlurmdUser=root
>>>>
>>>> #StateSaveLocation=/var/spool/slurm
>>>>
>>>> StateSaveLocation=/mnt/data/spool/slurm
>>>>
>>>> SwitchType=switch/none
>>>>
>>>> TaskPlugin=task/none
>>>>
>>>> #
>>>>
>>>> #
>>>>
>>>> # TIMERS
>>>>
>>>> #KillWait=30
>>>>
>>>> #MinJobAge=300
>>>>
>>>> #SlurmctldTimeout=120
>>>>
>>>> #SlurmdTimeout=300
>>>>
>>>> #
>>>>
>>>> #
>>>>
>>>> # SCHEDULING
>>>>
>>>> FastSchedule=1
>>>>
>>>> SchedulerType=sched/backfill
>>>>
>>>> #SchedulerPort=7321
>>>>
>>>> SelectType=select/linear
>>>>
>>>> PreemptType=preempt/partition_prio
>>>>
>>>> PreemptMode=requeue
>>>>
>>>> #
>>>>
>>>> #
>>>>
>>>> # LOGGING AND ACCOUNTING
>>>>
>>>> AccountingStorageType=accounting_storage/slurmdbd
>>>>
>>>> #JobAcctGatherFrequency=30
>>>>
>>>> JobAcctGatherType=jobacct_gather/linux
>>>>
>>>> JobCompType=jobcomp/none
>>>>
>>>> #SlurmctldDebug=3
>>>>
>>>> #SlurmctldLogFile=/var/log/slurmctld.log
>>>>
>>>> SlurmctldLogFile=/mnt/data/log/slurmctld.log
>>>>
>>>> #SlurmdDebug=3
>>>>
>>>> SlurmdLogFile=/var/log/slurmd.log
>>>>
>>>> AccountingStorageBackupHost=mb
>>>>
>>>> #
>>>>
>>>> #
>>>>
>>>> # COMPUTE NODES
>>>>
>>>> NodeName=n[1-8] NodeAddr=n[1-8] CPUs=1 State=UNKNOWN
>>>>
>>>> NodeName=logn NodeAddr=logn CPUs=1 State=UNKNOWN
>>>>
>>>>
>>> 2016-04-30 16:40 GMT+01:00 Andy Riebs < <[email protected]>
>>> [email protected]>:
>>>
>>>> Hi,
>>>>
>>>> The one problem that I see in your description is minor, and probably
>>>> not significant: the MPI ports parameter was needed for very old versions
>>>> of Open MPI, IIRC.
>>>>
>>>> To help debug your problems, please respond to this list with
>>>>
>>>>    1. What command did you use to invoke your program?
>>>>    2. What versions of Slurm and OpenMPI are you using?
>>>>    3. Did you build them yourself, or use prebuilt versions?
>>>>    - If you built them yourself, what configuration options did you
>>>>       use?
>>>>       - If pre-built versions, where did you get them?
>>>>    4. A copy of your slurm.conf file (you may want to change node
>>>>    names and other potentially sensitive information)
>>>>
>>>> Andy
>>>>
>>>> On 04/30/2016 10:02 AM, Mehdi Acheli wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I've set a basic configuration using�slurm�with
>>>> a master node, backup node, a login node and eight compute node.
>>>>
>>>> Everything in�slurm�is working fine. I can issue
>>>> jobs and see the state of the eight nodes as Idle. The problem is with
>>>> OpenMPI. The hello parallel program where each process prints its rank
>>>> among the global set is working but when i try to establish communications
>>>> between nodes through MPI_Send and MPI_Recv, it just hangs there
>>>> undefinitely.�
>>>>
>>>> I'm using CentOS 7, firewalld and SElinux are disabled. If i launch my
>>>> parallel program, ptest, on 2 nodes : [n1, n2], a little check with lsof -i
>>>> shows that ptest is listening on port 1024 on both nodes, which i find
>>>> weird since only one should be listening. Moreover, i've set slurm Mpi
>>>> parameters on pmi2 and ports allowed on [12000-12999], so why is it still
>>>> using port 1024 ?
>>>>
>>>> I hope u can help me with this problem. I can't see what's
>>>> wrong.�
>>>> Thank you in advance.
>>>>
>>>> M. Acheli.
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply via email to