Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Ralph Castain Wed, 1 Apr 2009 09:40:48 -0400

As an FYI: you can debug allocation issues more easily by:


mpirun --display-allocation --do-not-launch -n 1 foo

This will read the allocation, do whatever host filtering you specifywith -host and -hostfile options, report out the result, and thenterminate without trying to launch anything. I found it most usefulfor debugging these situations.


If you want to know where the procs would have gone, then you can do:

mpirun --display-allocation --display-map --do-not-launch -n 8 foo

In this case, the #procs you specify needs to be the number youactually wanted so that the mapper will properly run. However, theexecutable can be bogus and nothing will actually launch. It's theclosest you can come to a dry run of a job.


HTH
Ralph


On Apr 1, 2009, at 7:10 AM, Rolf Vandevaart wrote:

It turns out that the use of --host and --hostfile act as a filterof which nodes to run on when you are running under SGE. So,listing them several times does not affect where the processesland. However, this still does not explain why you are seeing whatyou are seeing. One thing you can try is to add this to the mpiruncommand.
-mca ras_gridengine_verbose 100
This will provide some additional information as to what Open MPI isseeing as nodes and slots from SGE. (Is there any chance thatnode0002 actually has 8 slots?)
I just retried on my cluster of 2 CPU sparc solaris nodes. When Irun with np=2, the two MPI processes will all land on a single node,because that node has two slots. When I go up to np=4, then theymove on to the other node. The --host acts as a filter to wherethey should run.
In terms of the using "IB bonding", I do not know what that meansexactly. Open MPI does stripe over multiple IB interfaces, so Ithink the answer is yes.
Rolf
PS: Here is what my np=4 job script looked like. (I just changednp=2 for the other run)
burl-ct-280r-0 148 =>more run.sh
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 200
#$ -j y
#$ -l h_rt=00:20:00      # Run time (hh:mm:ss) - 10 min

echo $NSLOTS
/opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcphostname
Here is the output (somewhat truncated)
burl-ct-280r-0 150 =>more Job1.o199
200
[burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
[burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE: /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
[..snip..]
[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILEshows slots=2[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILEshows slots=2
[..snip..]
burl-ct-280r-1
burl-ct-280r-1
burl-ct-280r-0
burl-ct-280r-0
burl-ct-280r-0 151 =>


On 03/31/09 22:39, PN wrote:
Dear Rolf,
Thanks for your reply.
I've created another PE and changed the submission script,explicitly specify the hostname with "--host".
However the result is the same.
# qconf -sp orte
pe_name            orte
slots              8
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --hostnode0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./bin/goto-openmpi-gcc/xhpl
# pdsh -a ps ax --width=200|grep hpl
node0002: 18901 ? S 0:00 /opt/openmpi-gcc/bin/mpirun -v-np 8 --hostnode0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./bin/goto-openmpi-gcc/xhpl
node0002: 18902 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18903 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18904 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18905 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18906 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18907 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18908 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18909 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
Any hint to debug this situation?
Also, if I have 2 IB ports in each node, which IB bonding was done,will Open MPI automatically benefit from the double bandwidth?
Thanks a lot.
Best Regards,
PN
2009/4/1 Rolf Vandevaart <rolf.vandeva...@sun.com <mailto:rolf.vandeva...@sun.com>>
   On 03/31/09 11:43, PN wrote:
       Dear all,
       I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
I have 2 compute nodes for testing, each node has a singlequad
       core CPU.
       Here is my submission script and PE config:
       $ cat hpl-8cpu.sge
       #!/bin/bash
       #
       #$ -N HPL_8cpu_IB
       #$ -pe mpi-fu 8
       #$ -cwd
       #$ -j y
       #$ -S /bin/bash
       #$ -V
       #
       cd /home/admin/hpl-2.0
       # For IB
       /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
       $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
I've tested the mpirun command can be run correctly incommand line.
       $ qconf -sp mpi-fu
       pe_name            mpi-fu
       slots              8
       user_lists         NONE
       xuser_lists        NONE
start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh$pe_hostfile
       stop_proc_args     /opt/sge/mpi/stopmpi.sh
       allocation_rule    $fill_up
       control_slaves     TRUE
       job_is_first_task  FALSE
       urgency_slots      min
       accounting_summary TRUE
I've checked the $TMPDIR/machines after submit, it wascorrect.
       node0002
       node0002
       node0002
       node0002
       node0001
       node0001
       node0001
       node0001
However, I found that if I explicitly specify the "-machinefile
       $TMPDIR/machines", all 8 mpi processes were spawned within a
       single node, i.e. node0002.
       However, if I omit "-machinefile $TMPDIR/machines" in the line
       mpirun, i.e.
       /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
       ./bin/goto-openmpi-gcc/xhpl
       The mpi processes can start correctly, 4 processes in node0001
       and 4 processes in node0002.
       Is this normal behaviour of Open MPI?
I just tried it both ways and I got the same result both times.The
   processes are split between the nodes.  Perhaps to be extra sure,
   you can just run hostname?  And for what it is worth, as you have
seen, you do not need to specify a machines file. Open MPI willuse
   the ones that were allocated by SGE.  You can also change your
   parallel queue to not run any scripts.  Like this:
   start_proc_args    /bin/true
   stop_proc_args     /bin/true
       Also, I wondered if I have IB interface, for example, the
       hostname of IB become node0001-clust and node0002-clust, will
       Open MPI automatically use the IB interface?
   Yes, it should use the IB interface.
       How about if I have 2 IB ports in each node, which IB bonding
       was done, will Open MPI automatically benefit from the double
       bandwidth?
       Thanks a lot.
       Best Regards,
       PN
------------------------------------------------------------------------
       _______________________________________________
       users mailing list
       us...@open-mpi.org <mailto:us...@open-mpi.org>
       http://www.open-mpi.org/mailman/listinfo.cgi/users
   --     =========================
   rolf.vandeva...@sun.com <mailto:rolf.vandeva...@sun.com>
   781-442-3043
   =========================
   _______________________________________________
   users mailing list
   us...@open-mpi.org <mailto:us...@open-mpi.org>
   http://www.open-mpi.org/mailman/listinfo.cgi/users
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Reply via email to