Re: [OMPI users] Strange behaviour of SGE+OpenMPI

PN Wed, 1 Apr 2009 12:31:20 -0400

Thanks. I've tried your suggestion.

$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun -mca ras_gridengine_verbose 100 -v -np $NSLOTS
--host node0001,node0002 hostname



It allocated 2 nodes to run, however all the processes are spawned in
node0001.

$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
al...@node0001.v5cluster.com   BIPC  0/4/4          4.79     lx24-amd64
     45 0.55500 HPL_8cpu_G admin        r     04/02/2009 00:26:49     4
---------------------------------------------------------------------------------
al...@node0002.v5cluster.com   BIPC  0/4/4          0.00     lx24-amd64
     45 0.55500 HPL_8cpu_G admin        r     04/02/2009 00:26:49     4


$ cat HPL_8cpu_GB.o45
[node0001:03194] ras:gridengine: JOB_ID: 45
[node0001:03194] ras:gridengine: node0001.v5cluster.com: PE_HOSTFILE shows
slots=4
[node0001:03194] ras:gridengine: node0002.v5cluster.com: PE_HOSTFILE shows
slots=4
node0001
node0001
node0001
node0001
node0001
node0001
node0001
node0001

$ qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:01:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             blcr
pe_list               make mpi-rr mpi-fu orte
rerun                 FALSE
slots                 4,[node0001=4],[node0002=4]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

$ qconf -se node0001
hostname              node0001.v5cluster.com
load_scaling          NONE
complex_values        slots=4
load_values           arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
                      swap_total=0.000000M,virtual_total=3949.597656M, \
                      load_avg=2.800000,load_short=0.220000, \
                      load_medium=2.800000,load_long=2.320000, \
                      mem_free=3818.746094M,swap_free=0.000000M, \
                      virtual_free=3818.746094M,mem_used=130.851562M, \
                      swap_used=0.000000M,virtual_used=130.851562M, \
                      cpu=0.000000,np_load_avg=0.700000, \
                      np_load_short=0.055000,np_load_medium=0.700000, \
                      np_load_long=0.580000
processors            4
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

$ qconf -se node0002
hostname              node0002.v5cluster.com
load_scaling          NONE
complex_values        slots=4
load_values           arch=lx24-amd64,num_proc=4,mem_total=3949.597656M, \
                      swap_total=0.000000M,virtual_total=3949.597656M, \
                      load_avg=0.000000,load_short=0.000000, \
                      load_medium=0.000000,load_long=0.000000, \
                      mem_free=3843.074219M,swap_free=0.000000M, \
                      virtual_free=3843.074219M,mem_used=106.523438M, \
                      swap_used=0.000000M,virtual_used=106.523438M, \
                      cpu=0.000000,np_load_avg=0.000000, \
                      np_load_short=0.000000,np_load_medium=0.000000, \
                      np_load_long=0.000000
processors            4
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE



2009/4/1 Rolf Vandevaart <rolf.vandeva...@sun.com>

> It turns out that the use of --host and --hostfile act as a filter of which
> nodes to run on when you are running under SGE.  So, listing them several
> times does not affect where the processes land.  However, this still does
> not explain why you are seeing what you are seeing.  One thing you can try
> is to add this to the mpirun command.
>
>  -mca ras_gridengine_verbose 100
>
> This will provide some additional information as to what Open MPI is seeing
> as nodes and slots from SGE.  (Is there any chance that node0002 actually
> has 8 slots?)
>
> I just retried on my cluster of 2 CPU sparc solaris nodes.  When I run with
> np=2, the two MPI processes will all land on a single node, because that
> node has two slots.  When I go up to np=4, then they move on to the other
> node.  The --host acts as a filter to where they should run.
>
> In terms of the using "IB bonding", I do not know what that means exactly.
>  Open MPI does stripe over multiple IB interfaces, so I think the answer is
> yes.
>
> Rolf
>
> PS:  Here is what my np=4 job script looked like.  (I just changed np=2 for
> the other run)
>
>  burl-ct-280r-0 148 =>more run.sh
> #! /bin/bash
> #$ -S /bin/bash
> #$ -V
> #$ -cwd
> #$ -N Job1
> #$ -pe orte 200
> #$ -j y
> #$ -l h_rt=00:20:00      # Run time (hh:mm:ss) - 10 min
>
> echo $NSLOTS
> /opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np 4
> -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname
>
> Here is the output (somewhat truncated)
>  burl-ct-280r-0 150 =>more Job1.o199
> 200
> [burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
> [burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE:
> /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
> [..snip..]
> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows
> slots=2
> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows
> slots=2
> [..snip..]
> burl-ct-280r-1
> burl-ct-280r-1
> burl-ct-280r-0
> burl-ct-280r-0
>  burl-ct-280r-0 151 =>
>
>
>
> On 03/31/09 22:39, PN wrote:
>
>> Dear Rolf,
>>
>> Thanks for your reply.
>> I've created another PE and changed the submission script, explicitly
>> specify the hostname with "--host".
>> However the result is the same.
>>
>> # qconf -sp orte
>> pe_name            orte
>> slots              8
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $fill_up
>> control_slaves     TRUE
>> job_is_first_task  FALSE
>> urgency_slots      min
>> accounting_summary TRUE
>>
>> $ cat hpl-8cpu-test.sge
>> #!/bin/bash
>> #
>> #$ -N HPL_8cpu_GB
>> #$ -pe orte 8
>> #$ -cwd
>> #$ -j y
>> #$ -S /bin/bash
>> #$ -V
>> #
>> cd /home/admin/hpl-2.0
>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host
>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>> ./bin/goto-openmpi-gcc/xhpl
>>
>>
>> # pdsh -a ps ax --width=200|grep hpl
>> node0002: 18901 ?        S      0:00 /opt/openmpi-gcc/bin/mpirun -v -np 8
>> --host
>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>> ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18902 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18903 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18904 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18905 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18906 ?        RLl    0:29 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18907 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18908 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>> node0002: 18909 ?        RLl    0:28 ./bin/goto-openmpi-gcc/xhpl
>>
>> Any hint to debug this situation?
>>
>> Also, if I have 2 IB ports in each node, which IB bonding was done, will
>> Open MPI automatically benefit from the double bandwidth?
>>
>> Thanks a lot.
>>
>> Best Regards,
>> PN
>>
>> 2009/4/1 Rolf Vandevaart <rolf.vandeva...@sun.com <mailto:
>> rolf.vandeva...@sun.com>>
>>
>>
>>    On 03/31/09 11:43, PN wrote:
>>
>>        Dear all,
>>
>>        I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
>>        I have 2 compute nodes for testing, each node has a single quad
>>        core CPU.
>>
>>        Here is my submission script and PE config:
>>        $ cat hpl-8cpu.sge
>>        #!/bin/bash
>>        #
>>        #$ -N HPL_8cpu_IB
>>        #$ -pe mpi-fu 8
>>        #$ -cwd
>>        #$ -j y
>>        #$ -S /bin/bash
>>        #$ -V
>>        #
>>        cd /home/admin/hpl-2.0
>>        # For IB
>>        /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
>>        $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
>>
>>        I've tested the mpirun command can be run correctly in command
>> line.
>>
>>        $ qconf -sp mpi-fu
>>        pe_name            mpi-fu
>>        slots              8
>>        user_lists         NONE
>>        xuser_lists        NONE
>>        start_proc_args    /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
>>        stop_proc_args     /opt/sge/mpi/stopmpi.sh
>>        allocation_rule    $fill_up
>>        control_slaves     TRUE
>>        job_is_first_task  FALSE
>>        urgency_slots      min
>>        accounting_summary TRUE
>>
>>
>>        I've checked the $TMPDIR/machines after submit, it was correct.
>>        node0002
>>        node0002
>>        node0002
>>        node0002
>>        node0001
>>        node0001
>>        node0001
>>        node0001
>>
>>        However, I found that if I explicitly specify the "-machinefile
>>        $TMPDIR/machines", all 8 mpi processes were spawned within a
>>        single node, i.e. node0002.
>>
>>        However, if I omit "-machinefile $TMPDIR/machines" in the line
>>        mpirun, i.e.
>>        /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
>>        ./bin/goto-openmpi-gcc/xhpl
>>
>>        The mpi processes can start correctly, 4 processes in node0001
>>        and 4 processes in node0002.
>>
>>        Is this normal behaviour of Open MPI?
>>
>>
>>    I just tried it both ways and I got the same result both times.  The
>>    processes are split between the nodes.  Perhaps to be extra sure,
>>    you can just run hostname?  And for what it is worth, as you have
>>    seen, you do not need to specify a machines file.  Open MPI will use
>>    the ones that were allocated by SGE.  You can also change your
>>    parallel queue to not run any scripts.  Like this:
>>
>>    start_proc_args    /bin/true
>>    stop_proc_args     /bin/true
>>
>>
>>
>>        Also, I wondered if I have IB interface, for example, the
>>        hostname of IB become node0001-clust and node0002-clust, will
>>        Open MPI automatically use the IB interface?
>>
>>    Yes, it should use the IB interface.
>>
>>
>>        How about if I have 2 IB ports in each node, which IB bonding
>>        was done, will Open MPI automatically benefit from the double
>>        bandwidth?
>>
>>        Thanks a lot.
>>
>>        Best Regards,
>>        PN
>>
>>
>>
>>  ------------------------------------------------------------------------
>>
>>        _______________________________________________
>>        users mailing list
>>        us...@open-mpi.org <mailto:us...@open-mpi.org>
>>        http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>>    --
>>    =========================
>>    rolf.vandeva...@sun.com <mailto:rolf.vandeva...@sun.com>
>>    781-442-3043
>>    =========================
>>    _______________________________________________
>>    users mailing list
>>    us...@open-mpi.org <mailto:us...@open-mpi.org>
>>    http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
>
> =========================
> rolf.vandeva...@sun.com
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Reply via email to