Re: [OMPI users] Strange behaviour of SGE+OpenMPI

PN Wed, 1 Apr 2009 12:38:01 -0400

Thanks.

$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun --display-allocation --display-map -v -np
$NSLOTS --host node0001,node0002 hostname



$ cat HPL_8cpu_GB.o46

======================   ALLOCATED NODES   ======================

 Data for node: Name: node0001  Num slots: 4    Max slots: 0
 Data for node: Name: node0002.v5cluster.com    Num slots: 4    Max slots: 0

=================================================================

 ========================   JOB MAP   ========================

 Data for node: Name: node0001  Num procs: 8
        Process OMPI jobid: [10982,1] Process rank: 0
        Process OMPI jobid: [10982,1] Process rank: 1
        Process OMPI jobid: [10982,1] Process rank: 2
        Process OMPI jobid: [10982,1] Process rank: 3
        Process OMPI jobid: [10982,1] Process rank: 4
        Process OMPI jobid: [10982,1] Process rank: 5
        Process OMPI jobid: [10982,1] Process rank: 6
        Process OMPI jobid: [10982,1] Process rank: 7

 =============================================================
node0001
node0001
node0001
node0001
node0001
node0001
node0001
node0001

I'm not sure why node0001 miss the domain name, is this related?
However the result is correct when I run "qconf -sel"

$ qconf -sel
node0001.v5cluster.com
node0002.v5cluster.com





2009/4/1 Ralph Castain <r...@lanl.gov>

> Rolf has correctly reminded me that display-allocation occurs prior to host
> filtering, so you will see all of the allocated nodes. You'll see the impact
> of the host specifications in display-map,
>
> Sorry for the confusion - thanks to Rolf for pointing it out.
> Ralph
>
>
> On Apr 1, 2009, at 7:40 AM, Ralph Castain wrote:
>
>  As an FYI: you can debug allocation issues more easily by:
>>
>> mpirun --display-allocation --do-not-launch -n 1 foo
>>
>> This will read the allocation, do whatever host filtering you specify with
>> -host and -hostfile options, report out the result, and then terminate
>> without trying to launch anything. I found it most useful for debugging
>> these situations.
>>
>> If you want to know where the procs would have gone, then you can do:
>>
>> mpirun --display-allocation --display-map --do-not-launch -n 8 foo
>>
>> In this case, the #procs you specify needs to be the number you actually
>> wanted so that the mapper will properly run. However, the executable can be
>> bogus and nothing will actually launch. It's the closest you can come to a
>> dry run of a job.
>>
>> HTH
>> Ralph
>>
>>
>> On Apr 1, 2009, at 7:10 AM, Rolf Vandevaart wrote:
>>
>>  It turns out that the use of --host and --hostfile act as a filter of
>>> which nodes to run on when you are running under SGE.  So, listing them
>>> several times does not affect where the processes land.  However, this still
>>> does not explain why you are seeing what you are seeing.  One thing you can
>>> try is to add this to the mpirun command.
>>>
>>> -mca ras_gridengine_verbose 100
>>>
>>> This will provide some additional information as to what Open MPI is
>>> seeing as nodes and slots from SGE.  (Is there any chance that node0002
>>> actually has 8 slots?)
>>>
>>> I just retried on my cluster of 2 CPU sparc solaris nodes.  When I run
>>> with np=2, the two MPI processes will all land on a single node, because
>>> that node has two slots.  When I go up to np=4, then they move on to the
>>> other node.  The --host acts as a filter to where they should run.
>>>
>>> In terms of the using "IB bonding", I do not know what that means
>>> exactly.  Open MPI does stripe over multiple IB interfaces, so I think the
>>> answer is yes.
>>>
>>> Rolf
>>>
>>> PS:  Here is what my np=4 job script looked like.  (I just changed np=2
>>> for the other run)
>>>
>>> burl-ct-280r-0 148 =>more run.sh
>>> #! /bin/bash
>>> #$ -S /bin/bash
>>> #$ -V
>>> #$ -cwd
>>> #$ -N Job1
>>> #$ -pe orte 200
>>> #$ -j y
>>> #$ -l h_rt=00:20:00      # Run time (hh:mm:ss) - 10 min
>>>
>>> echo $NSLOTS
>>> /opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np
>>> 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname
>>>
>>> Here is the output (somewhat truncated)
>>> burl-ct-280r-0 150 =>more Job1.o199
>>> 200
>>> [burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
>>> [burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE:
>>> /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
>>> [..snip..]
>>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows
>>> slots=2
>>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows
>>> slots=2
>>> [..snip..]
>>> burl-ct-280r-1
>>> burl-ct-280r-1
>>> burl-ct-280r-0
>>> burl-ct-280r-0
>>> burl-ct-280r-0 151 =>
>>>
>>>
>>> On 03/31/09 22:39, PN wrote:
>>>
>>>> Dear Rolf,
>>>> Thanks for your reply.
>>>> I've created another PE and changed the submission script, explicitly
>>>> specify the hostname with "--host".
>>>> However the result is the same.
>>>> # qconf -sp orte
>>>> pe_name            orte
>>>> slots              8
>>>> user_lists         NONE
>>>> xuser_lists        NONE
>>>> start_proc_args    /bin/true
>>>> stop_proc_args     /bin/true
>>>> allocation_rule    $fill_up
>>>> control_slaves     TRUE
>>>> job_is_first_task  FALSE
>>>> urgency_slots      min
>>>> accounting_summary TRUE
>>>> $ cat hpl-8cpu-test.sge
>>>> #!/bin/bash
>>>> #
>>>> #$ -N HPL_8cpu_GB
>>>> #$ -pe orte 8
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -S /bin/bash
>>>> #$ -V
>>>> #
>>>> cd /home/admin/hpl-2.0
>>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host
>>>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>>>> ./bin/goto-openmpi-gcc/xhpl
>>>> # pdsh -a ps ax --width=200|grep hpl
>>>> node0002: 18901 ?        S      0:00 /opt/openmpi-gcc/bin/mpirun -v -np
>>>> 8 --host
>>>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>>>> ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18902 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18903 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18904 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18905 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18906 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18907 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18908 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18909 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> Any hint to debug this situation?
>>>> Also, if I have 2 IB ports in each node, which IB bonding was done, will
>>>> Open MPI automatically benefit from the double bandwidth?
>>>> Thanks a lot.
>>>> Best Regards,
>>>> PN
>>>> 2009/4/1 Rolf Vandevaart <rolf.vandeva...@sun.com <mailto:
>>>> rolf.vandeva...@sun.com>>
>>>>  On 03/31/09 11:43, PN wrote:
>>>>      Dear all,
>>>>      I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
>>>>      I have 2 compute nodes for testing, each node has a single quad
>>>>      core CPU.
>>>>      Here is my submission script and PE config:
>>>>      $ cat hpl-8cpu.sge
>>>>      #!/bin/bash
>>>>      #
>>>>      #$ -N HPL_8cpu_IB
>>>>      #$ -pe mpi-fu 8
>>>>      #$ -cwd
>>>>      #$ -j y
>>>>      #$ -S /bin/bash
>>>>      #$ -V
>>>>      #
>>>>      cd /home/admin/hpl-2.0
>>>>      # For IB
>>>>      /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
>>>>      $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
>>>>      I've tested the mpirun command can be run correctly in command
>>>> line.
>>>>      $ qconf -sp mpi-fu
>>>>      pe_name            mpi-fu
>>>>      slots              8
>>>>      user_lists         NONE
>>>>      xuser_lists        NONE
>>>>      start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
>>>>      stop_proc_args     /opt/sge/mpi/stopmpi.sh
>>>>      allocation_rule    $fill_up
>>>>      control_slaves     TRUE
>>>>      job_is_first_task  FALSE
>>>>      urgency_slots      min
>>>>      accounting_summary TRUE
>>>>      I've checked the $TMPDIR/machines after submit, it was correct.
>>>>      node0002
>>>>      node0002
>>>>      node0002
>>>>      node0002
>>>>      node0001
>>>>      node0001
>>>>      node0001
>>>>      node0001
>>>>      However, I found that if I explicitly specify the "-machinefile
>>>>      $TMPDIR/machines", all 8 mpi processes were spawned within a
>>>>      single node, i.e. node0002.
>>>>      However, if I omit "-machinefile $TMPDIR/machines" in the line
>>>>      mpirun, i.e.
>>>>      /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
>>>>      ./bin/goto-openmpi-gcc/xhpl
>>>>      The mpi processes can start correctly, 4 processes in node0001
>>>>      and 4 processes in node0002.
>>>>      Is this normal behaviour of Open MPI?
>>>>  I just tried it both ways and I got the same result both times.  The
>>>>  processes are split between the nodes.  Perhaps to be extra sure,
>>>>  you can just run hostname?  And for what it is worth, as you have
>>>>  seen, you do not need to specify a machines file.  Open MPI will use
>>>>  the ones that were allocated by SGE.  You can also change your
>>>>  parallel queue to not run any scripts.  Like this:
>>>>  start_proc_args    /bin/true
>>>>  stop_proc_args     /bin/true
>>>>      Also, I wondered if I have IB interface, for example, the
>>>>      hostname of IB become node0001-clust and node0002-clust, will
>>>>      Open MPI automatically use the IB interface?
>>>>  Yes, it should use the IB interface.
>>>>      How about if I have 2 IB ports in each node, which IB bonding
>>>>      was done, will Open MPI automatically benefit from the double
>>>>      bandwidth?
>>>>      Thanks a lot.
>>>>      Best Regards,
>>>>      PN
>>>>
>>>>  ------------------------------------------------------------------------
>>>>      _______________________________________________
>>>>      users mailing list
>>>>      us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>      http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>  --     =========================
>>>>  rolf.vandeva...@sun.com <mailto:rolf.vandeva...@sun.com>
>>>>  781-442-3043
>>>>  =========================
>>>>  _______________________________________________
>>>>  users mailing list
>>>>  us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> ------------------------------------------------------------------------
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> --
>>>
>>> =========================
>>> rolf.vandeva...@sun.com
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Reply via email to