Re: [gridengine users] Round Robin x Fill Up

Sergio Mafra Sat, 27 Jul 2013 13:31:16 -0700

Hi Reuti,

It seems that the previous tests are wrong.
I realize that your doubts are right.. There was only one slot being busy
despite all 16 being deployed.


I´d change the job launcher to:

$qsub -N $nameofthecase -b y -pe orte 20 -cwd mpiexec -np 20 newave170502_L

Note that (for some reason) it´s mandatory to tell PE and mpi that are 20
slots to use.

Doing that, it comes this output for a job with 20 slots

$round_robin:

job with 20 slots
job launched as
$qsub -N $nameofthecase -b y -pe orte 20 -cwd mpiexec -np 20 newave170502_L

$ ps -e f --cols=500
 2390 ?        Sl     0:00 /opt/sge6/bin/linux-x64/sge_execd
 2835 ?        S      0:00  \_ sge_shepherd-1 -bg
 2837 ?        Ss     0:00      \_ mpiexec -np 20 newave170502_L
 2838 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
--control-port master:46220 --demux poll --pgid 0 --retries 10 --proxy-id 0
 2840 ?        R      1:18          |   \_ newave170502_L
 2841 ?        S      0:54          |   \_ newave170502_L
 2842 ?        S      1:07          |   \_ newave170502_L
 2843 ?        S      0:52          |   \_ newave170502_L
 2844 ?        S      1:07          |   \_ newave170502_L
 2845 ?        S      1:08          |   \_ newave170502_L
 2846 ?        S      0:00          |   \_ newave170502_L
 2847 ?        S      0:00          |   \_ newave170502_L
 2848 ?        S      0:00          |   \_ newave170502_L
 2849 ?        S      0:00          |   \_ newave170502_L
 2839 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh
-inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:46220
--demux poll --pgid 0 --retries 10 --proxy-id 1


$ mpiexec --version
 HYDRA build details:
    Version:                                 1.4
    Release Date:                            Thu Jun 16 16:41:08 CDT 2011
    CC:                              gcc
 -I/build/buildd/mpich2-1.4/src/mpl/include
-I/build/buildd/mpich2-1.4/src/mpl/include
-I/build/buildd/mpich2-1.4/src/openpa/src
-I/build/buildd/mpich2-1.4/src/openpa/src
-I/build/buildd/mpich2-1.4/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4/src/mpid/common/locks
-I/build/buildd/mpich2-1.4/src/mpid/common/locks
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4/src/util/wrappers
-I/build/buildd/mpich2-1.4/src/util/wrappers  -g -O2 -g -O2 -Wall -O2
 -Wl,-Bsymbolic-functions  -lrt -lcr -lpthread
    CXX:
    F77:
    F90:                             gfortran  -Wl,-Bsymbolic-functions
 -lrt -lcr -lpthread
    Configure options:                       '--build=x86_64-linux-gnu'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var'
'--libexecdir=${prefix}/lib/mpich2' '--srcdir=.'
'--disable-maintainer-mode' '--disable-dependency-tracking'
'--disable-silent-rules' '--enable-shared' '--prefix=/usr' '--enable-fc'
'--disable-rpath' '--sysconfdir=/etc/mpich2'
'--includedir=/usr/include/mpich2' '--docdir=/usr/share/doc/mpich2'
'--with-hwloc-prefix=system' '--enable-checkpointing'
'--with-hydra-ckpointlib=blcr' 'build_alias=x86_64-linux-gnu'
'MPICH2LIB_CFLAGS=-g -O2 -g -O2 -Wall' 'MPICH2LIB_CXXFLAGS=-g -O2 -g -O2
-Wall' 'MPICH2LIB_FFLAGS=-g -O2' 'MPICH2LIB_FCFLAGS='
'LDFLAGS=-Wl,-Bsymbolic-functions ' 'CPPFLAGS=
-I/build/buildd/mpich2-1.4/src/mpl/include
-I/build/buildd/mpich2-1.4/src/mpl/include
-I/build/buildd/mpich2-1.4/src/openpa/src
-I/build/buildd/mpich2-1.4/src/openpa/src
-I/build/buildd/mpich2-1.4/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/include
-I/build/buildd/mpich2-1.4/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4/src/mpid/common/datatype
-I/build/buildd/mpich2-1.4/src/mpid/common/locks
-I/build/buildd/mpich2-1.4/src/mpid/common/locks
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/include
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4/src/mpid/ch3/channels/nemesis/nemesis/utils/monitor
-I/build/buildd/mpich2-1.4/src/util/wrappers
-I/build/buildd/mpich2-1.4/src/util/wrappers' 'FFLAGS= -g -O2 -O2'
'FC=gfortran' 'CFLAGS= -g -O2 -g -O2 -Wall -O2' 'CXXFLAGS= -g -O2 -g -O2
-Wall -O2' '--disable-option-checking' 'CC=gcc' 'LIBS=-lrt -lcr -lpthread '
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge
none persist
    Binding libraries available:             hwloc plpa
    Resource management kernels available:   none slurm ll lsf sge pbs
    Checkpointing libraries available:       blcr
    Demux engines available:                 poll select

$ ps -eLf
sgeadmin  2837  2835  2837  0    1 19:49 ?        00:00:00 mpiexec -np 20
newave170502_L
sgeadmin  2838  2837  2838  0    1 19:49 ?        00:00:00
/usr/bin/hydra_pmi_proxy --control-port master:46220 --demux poll --pgid 0
--retries 10 --proxy-id 0
sgeadmin  2839  2837  2839  0    3 19:49 ?        00:00:00
/opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy"
--control-port master:46220 --demux poll -
sgeadmin  2839  2837  2850  0    3 19:49 ?        00:00:00
/opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy"
--control-port master:46220 --demux poll -
sgeadmin  2839  2837  2851  0    3 19:49 ?        00:00:00
/opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy"
--control-port master:46220 --demux poll -
sgeadmin  2840  2838  2840 98    1 19:49 ?        00:04:32 newave170502_L
sgeadmin  2841  2838  2841 89    1 19:49 ?        00:04:05 newave170502_L
sgeadmin  2842  2838  2842 93    1 19:49 ?        00:04:18 newave170502_L
sgeadmin  2843  2838  2843 88    1 19:49 ?        00:04:03 newave170502_L
sgeadmin  2844  2838  2844 93    1 19:49 ?        00:04:19 newave170502_L
sgeadmin  2845  2838  2845 94    1 19:49 ?        00:04:20 newave170502_L
sgeadmin  2846  2838  2846 69    1 19:49 ?        00:03:11 newave170502_L
sgeadmin  2847  2838  2847 69    1 19:49 ?        00:03:11 newave170502_L
sgeadmin  2848  2838  2848 69    1 19:49 ?        00:03:11 newave170502_L
sgeadmin  2849  2838  2849 69    1 19:49 ?        00:03:11 newave170502_L
sgeadmin  2858  2491  2858  0    1 19:54 pts/0    00:00:00 ps -eLf

$ cat /etc/hosts
127.0.0.1 ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Added by cloud-init
127.0.1.1       ip-10-17-48-113.ec2.internal ip-10-17-48-113
10.17.48.113 master
10.17.48.210 node001

$ which mpiexec
/usr/bin/mpiexec

$ cat newave.tim (this is an output of the mpi app showing that 20 slots
are being used)
Programa Newave
Versao 17.5.2
Caso: PMO JANEIRO - 2011  29/12/2010 CVAR L25 A25 niveis para 31/12 NW
Versao 17.5.x
Data: 27-07-2013
Hora: 19h 49min 28.425sec
Numero de Processadores:   20 (<-- number of processors)

Everything runs fine. The job is divided into the 2 servers equally,
occupying 10 slots in each one.

Now.. if I change PE to $fill_up and submit the same 20 slots´ job..
something weird happens.

Let´s see:

$fill_up
job with 20 slots
job launched as
$qsub -N $NOMECASO -b y -pe orte 20 -cwd mpiexec -np 20 newave170502_L

$ ps -e f --cols=500
 2390 ?        Sl     0:01 /opt/sge6/bin/linux-x64/sge_execd
 2890 ?        S      0:00  \_ sge_shepherd-2 -bg
 2892 ?        Ss     0:00      \_ mpiexec -np 20 newave170502_L
 2893 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
--control-port master:37827 --demux poll --pgid 0 --retries 10 --proxy-id 0
 2895 ?        R      0:31          |   \_ newave170502_L
 2896 ?        R      0:24          |   \_ newave170502_L
 2897 ?        R      0:24          |   \_ newave170502_L
 2898 ?        R      0:24          |   \_ newave170502_L
 2899 ?        R      0:24          |   \_ newave170502_L
 2900 ?        R      0:24          |   \_ newave170502_L
 2901 ?        S      0:00          |   \_ newave170502_L
 2902 ?        S      0:00          |   \_ newave170502_L
 2903 ?        S      0:00          |   \_ newave170502_L
 2904 ?        S      0:00          |   \_ newave170502_L
 2894 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh
-inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:37827
--demux poll --pgid 0 --retries 10 --proxy-id 1

 $ qstat -f
 queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
all.q@master                   BIP   0/16/16        8.20     linux-x64
      2 0.55500 pmo_2011-0 sgeadmin     r     07/27/2013 20:01:11    16
---------------------------------------------------------------------------------
all.q@node001                  BIP   0/4/16         8.24     linux-x64
      2 0.55500 pmo_2011-0 sgeadmin     r     07/27/2013 20:01:11     4

*** As you can see, the queue filled up the first server and use the 4
slots of the second, but..
the mpi used 10 slots of the first server and 10 in the other one.

If I resubmit it, now with 16 slots:

job with 16 slots

$ ps -e f --cols=500
 2932 ?        S      0:00  \_ sge_shepherd-3 -bg
 2934 ?        Ss     0:00      \_ mpiexec -np 16 newave170502_L
 2935 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
--control-port master:50693 --demux poll --pgid 0 --retries 10 --proxy-id 0
 2937 ?        S      0:00          |   \_ newave170502_L
 2938 ?        S      0:00          |   \_ newave170502_L
 2939 ?        S      0:00          |   \_ newave170502_L
 2940 ?        S      0:00          |   \_ newave170502_L
 2941 ?        S      0:00          |   \_ newave170502_L
 2942 ?        S      0:00          |   \_ newave170502_L
 2943 ?        S      0:00          |   \_ newave170502_L
 2944 ?        S      0:00          |   \_ newave170502_L
 2936 ?        Z      0:00          \_ [qrsh] <defunct>

$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
all.q@master                   BIP   0/16/16        4.39     linux-x64
      3 0.55500 pmo_2011-0 sgeadmin     r     07/27/2013 20:12:26    16
---------------------------------------------------------------------------------
all.q@node001                  BIP   0/0/16         4.67     linux-x64

$ ps -eLf
sgeadmin  2934  2932  2934  0    1 20:12 ?        00:00:00 mpiexec -np 16
newave170502_L
sgeadmin  2935  2934  2935  0    1 20:12 ?        00:00:00
/usr/bin/hydra_pmi_proxy --control-port master:50693 --demux poll --pgid 0
--retries 10 --proxy-id 0
sgeadmin  2936  2934  2936  0    1 20:12 ?        00:00:00 [qrsh] <defunct>
sgeadmin  2937  2935  2937  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2938  2935  2938  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2939  2935  2939  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2940  2935  2940  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2941  2935  2941  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2942  2935  2942  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2943  2935  2943  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2944  2935  2944  0    1 20:12 ?        00:00:00 newave170502_L
sgeadmin  2949  2491  2949  0    1 20:14 pts/0    00:00:00 ps -eLf

*** Again you can see, the queue filled up the first server and use no
slots of the second, but..
the mpi used 8 slots of the first server and  tried to use 8 in the other
one but got an error...

Comments?


All the best and thank you so much for your time and effort to help in this
one...


Sergio


On Sat, Jul 27, 2013 at 3:58 PM, Reuti <[email protected]> wrote:

> Am 27.07.2013 um 16:25 schrieb Sergio Mafra:
>
> > Reuti,
> >
> > Aggregating all data...
> >
> > My cluster has 2 servers (master and node001), with 16 slots each one.
> >
> > My mpi app is newave170502_L
> >
> > I ran 3 tests:
> >
> > 1. $round_robin using 32 slots: (ran ok)
> >
> >  2382 ?        Sl     0:00 /opt/sge6/bin/linux-x64/sge_execd
> >  2817 ?        S      0:00  \_ sge_shepherd-1 -bg
> >  2819 ?        Ss     0:00      \_ mpiexec newave170502_L
> >  2820 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
> --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 0
> >  2822 ?        R      0:30          |   \_ newave170502_L
> >  2821 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh
> -inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:40945
> --demux poll --pgid 0 --ret
>
> As both nodes are used, this will succeed. I wonder why there is only one
> `newave170502` process. It should show 16 on each machine as child of the
> particular `hydra_pmi_proxy`.
>
> What is the output of:
>
> mpiexec --version
>
> Maybe the application is using threads in addition. Does:
>
> ps -eLf
>
> list more instances of the application?
>
>
> > 2. $fill_up with 16 slots: (aborted with error error: executing task of
> job 2 failed: execution daemon on host "node001" didn't accept task)
> >
> >  2842 ?        S      0:00  \_ sge_shepherd-2 -bg
> >  2844 ?        Ss     0:00      \_ mpiexec newave170502_L
> >  2845 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
> --control-port master:45562 --demux poll --pgid 0 --retries 10 --proxy-id 0
> >  2847 ?        S      0:00          |   \_ newave170502_L
> >  2846 ?        Z      0:00          \_ [qrsh] <defunct>
>
> SGE allocated all slots to the "master" and none to "node001", as the
> submitted job can get the required amount of slots from only one machine,
> there is no need to spread another task on "node001". They question is: why
> is your application (or even the `mpiexec`) trying to do so? There were
> cases, where SGE was misled due to contradictory entries in:
>
> /etc/hosts
>
> having two or more different names for each machine.
>
> - What is the content of this file in your machines?
>
> - Is
>
> > 3. $fill_up with 18 slots (ran ok):
> >
> >  2382 ?        Sl     0:01 /opt/sge6/bin/linux-x64/sge_execd
> >  2861 ?        Sl     0:00  \_ sge_shepherd-3 -bg
> >  2862 ?        Ss     0:00      \_
> /opt/sge6/utilbin/linux-x64/qrsh_starter
> /opt/sge6/default/spool/exec_spool_local/master/active_jobs/3.1/1.master
> >  2869 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
> --control-port node001:36673 --demux poll --pgid 0 --retries 10 --proxy-id 0
> >  2870 ?        R      0:24              \_ newave170502_L
>
> While in former times (with the old MPICH(1)) each slave task need its own
> `qrsh --inherit ...`, nowadays only one is used and all additional
> processes on the master or any slave node are forks.
>
> I guess even 17 would work, as it would need at least one slot from the
> other machine.
>
> - Is there any comment in the output of your application, how many
> processes were started for a computation?
>
> - Is the `mpiexec` a plain binary, or some kind of wrapper script?
>
> file `which mpiexec`
>
> If it's a symbolic link, it should point to mpiexec.hydra and the inquiry
> can be repeated.
>
> -- Reuti
>
>
> > ---------- Forwarded message ----------
> > From: Sergio Mafra <[email protected]>
> > Date: Sat, Jul 27, 2013 at 11:07 AM
> > Subject: Fwd: [gridengine users] Round Robin x Fill Up
> > To: Reuti <[email protected]>, "[email protected]" <
> [email protected]>
> >
> >
> > Appending to previous message.
> >
> > If I change to $fill_up and submit the same job using only 16 slots of
> 32 available slots. here comes the output:
> >
> >  2842 ?        S      0:00  \_ sge_shepherd-2 -bg
> >  2844 ?        Ss     0:00      \_ mpiexec newave170502_L
> >  2845 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
> --control-port master:45562 --demux poll --pgid 0 --retries 10 --proxy-id 0
> >  2847 ?        S      0:00          |   \_ newave170502_L
> >  2846 ?        Z      0:00          \_ [qrsh] <defunct>
> > ---------- Forwarded message ----------
> > From: Sergio Mafra <[email protected]>
> > Date: Sat, Jul 27, 2013 at 10:58 AM
> > Subject: Re: [gridengine users] Round Robin x Fill Up
> > To: Reuti <[email protected]>
> > Cc: "[email protected]" <[email protected]>
> >
> >
> > Hi Reuti,
> >
> > >Do you start in your job script any `mpiexec` resp. `mpirun` or is this
> issued already inside >the application you started? The question is,
> whether there is any additional "-hostlist", "->machinefile" or alike given
> as argument to this command and invalidating the generated >$PE_HOSTFILE of
> SGE.
> >
> > The job is started using mpiexec, in this way:
> > $ qsub -N $nameofthecase -b y -pe orte $1 -cwd mpiexec newave170502_L
> > where newave170502_L is the name of mpi app.
> >
> > >You can also try the following:
> > >
> > >- revert the PE definition to allocate by $round_robin
> > >- submit a job
> > >- SSH to the master node of the parallel job
> > >- issue:
> > >
> > >ps -e f --cols=500
> > >
> > >(f w/o -)
> >
> > >- somewhere should be the `mpiexec` resp. `mpirun` command. Can you
> please post >this line, it should be a child of the started job script.
> >
> > Here comes the output:
> >
> > 2382 ?        Sl     0:00 /opt/sge6/bin/linux-x64/sge_execd
> >  2817 ?        S      0:00  \_ sge_shepherd-1 -bg
> >  2819 ?        Ss     0:00      \_ mpiexec newave170502_L
> >  2820 ?        S      0:00          \_ /usr/bin/hydra_pmi_proxy
> --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 0
> >  2822 ?        R      0:30          |   \_ newave170502_L
> >  2821 ?        Sl     0:00          \_ /opt/sge6/bin/linux-x64/qrsh
> -inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:40945
> --demux poll --pgid 0 --retries 10 --proxy-id 1
> >
> > All best,
> >
> > Sergio
> >
> >
> > On Sat, Jul 27, 2013 at 10:13 AM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > Am 26.07.2013 um 23:26 schrieb Sergio Mafra:
> >
> > > Hi Reuti,
> > >
> > > Thanks for your prompt answer.
> > > Regarding yout questions:
> > >
> > > > How does you application read the list of granted machines?
> > > > Did you compile MPI on your own (which implementation in detail)?
> > >
> > > I´ve got no control or no documentation about this app. It was design
> by an Electrical Research Center for our proposes.
> > >
> > > > PS: I assume that with $round_robin simply all (or at least: many)
> nodes were access allowed to.
> > >
> > > Yes. It´s correct.
> > >
> > > >As now hosts are first filled before access to another one is
> granted, you might see the >effect of the former (possibly wrong)
> distribution of slave tasks to the nodes
> > >
> > > So I understand that the app should be recompiled to take advantages
> of $fill_up option?
> >
> > No necessarily, the used version of MPI is obviously prepared to run
> under the control of SGE, as it uses `qrsh -inherit ...` to start slave
> tasks on other nodes. Unfortunately also on machines/slots which weren't
> granted for this job and results in the error you mentioned first.
> >
> > Do you start in your job script any `mpiexec` resp. `mpirun` or is this
> issued already inside the application you started? The question is, whether
> there is any additional "-hostlist", "-machinefile" or alike given as
> argument to this command and invalidating the generated $PE_HOSTFILE of SGE.
> >
> > The MPI library should detect the granted allocation automatically, as
> it honors already that it's started under SGE.
> >
> > You can also try the following:
> >
> > - revert the PE definition to allocate by $round_robin
> > - submit a job
> > - SSH to the master node of the parallel job
> > - issue:
> >
> > ps -e f --cols=500
> >
> > (f w/o -)
> >
> > - somewhere should be the `mpiexec` resp. `mpirun` command. Can you
> please post this line, it should be a child of the started job script.
> >
> > -- Reuti
> >
> >
> > > All the best,
> > >
> > > Sergio
> > >
> > >
> > > On Fri, Jul 26, 2013 at 10:06 AM, Reuti <[email protected]>
> wrote:
> > > Hi,
> > >
> > > Am 26.07.2013 um 14:22 schrieb Sergio Mafra:
> > >
> > > > I'm using MIT StarCluster with mpich2 and OGE. Everything's ok.
> > > > But when I tried to change the strategy of distribution of work from
> Round Robin (default) to Fill Up... My problems had just began.
> > > > OGE keeps me teling that some nodes can not receive tasks...
> > >
> > > On the one hand this is a good sign, as it confirms that your PE is
> defined to control slave tasks on the nodes.
> > >
> > >
> > > > "Error: executing task of job 9 failed: execution daemon on host
> "node002" didn't accept task"It seems that my mpi app always tries to run
> in all nodes of the cluster, no matter if OGE doesn't allow it to do it.
> > > > Does anybody knows of a workaround ?
> > >
> > > This indicates, that you application tries to use a node in the
> cluster, which wasn't granted to this job by SGE.
> > >
> > > How does you application read the list of granted machines?
> > >
> > > Did you compile MPI on your own (which implementation in detail)?
> > >
> > > -- Reuti
> > >
> > > PS: I assume that with $round_robin simply all (or at least: many)
> nodes were access allowed to. As now hosts are first filled before access
> to another one is granted, you might see the effect of the former (possibly
> wrong) distribution of slave tasks to the nodes.
> > >
> >
> >
> >
> >
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Round Robin x Fill Up

Reply via email to