Hi Reuti, >Do you start in your job script any `mpiexec` resp. `mpirun` or is this issued already inside >the application you started? The question is, whether there is any additional "-hostlist", "->machinefile" or alike given as argument to this command and invalidating the generated >$PE_HOSTFILE of SGE.
The job is started using mpiexec, in this way: $ qsub -N $nameofthecase -b y -pe orte $1 -cwd mpiexec newave170502_L where newave170502_L is the name of mpi app. >You can also try the following: > >- revert the PE definition to allocate by $round_robin >- submit a job >- SSH to the master node of the parallel job >- issue: > >ps -e f --cols=500 > >(f w/o -) >- somewhere should be the `mpiexec` resp. `mpirun` command. Can you please post >this line, it should be a child of the started job script. Here comes the output: 2382 ? Sl 0:00 /opt/sge6/bin/linux-x64/sge_execd 2817 ? S 0:00 \_ sge_shepherd-1 -bg 2819 ? Ss 0:00 \_ mpiexec newave170502_L 2820 ? S 0:00 \_ /usr/bin/hydra_pmi_proxy --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 0 2822 ? R 0:30 | \_ newave170502_L 2821 ? Sl 0:00 \_ /opt/sge6/bin/linux-x64/qrsh -inherit -V node001 "/usr/bin/hydra_pmi_proxy" --control-port master:40945 --demux poll --pgid 0 --retries 10 --proxy-id 1 All best, Sergio On Sat, Jul 27, 2013 at 10:13 AM, Reuti <[email protected]> wrote: > Hi, > > Am 26.07.2013 um 23:26 schrieb Sergio Mafra: > > > Hi Reuti, > > > > Thanks for your prompt answer. > > Regarding yout questions: > > > > > How does you application read the list of granted machines? > > > Did you compile MPI on your own (which implementation in detail)? > > > > I´ve got no control or no documentation about this app. It was design by > an Electrical Research Center for our proposes. > > > > > PS: I assume that with $round_robin simply all (or at least: many) > nodes were access allowed to. > > > > Yes. It´s correct. > > > > >As now hosts are first filled before access to another one is granted, > you might see the >effect of the former (possibly wrong) distribution of > slave tasks to the nodes > > > > So I understand that the app should be recompiled to take advantages of > $fill_up option? > > No necessarily, the used version of MPI is obviously prepared to run under > the control of SGE, as it uses `qrsh -inherit ...` to start slave tasks on > other nodes. Unfortunately also on machines/slots which weren't granted for > this job and results in the error you mentioned first. > > Do you start in your job script any `mpiexec` resp. `mpirun` or is this > issued already inside the application you started? The question is, whether > there is any additional "-hostlist", "-machinefile" or alike given as > argument to this command and invalidating the generated $PE_HOSTFILE of SGE. > > The MPI library should detect the granted allocation automatically, as it > honors already that it's started under SGE. > > You can also try the following: > > - revert the PE definition to allocate by $round_robin > - submit a job > - SSH to the master node of the parallel job > - issue: > > ps -e f --cols=500 > > (f w/o -) > > - somewhere should be the `mpiexec` resp. `mpirun` command. Can you please > post this line, it should be a child of the started job script. > > -- Reuti > > > > All the best, > > > > Sergio > > > > > > On Fri, Jul 26, 2013 at 10:06 AM, Reuti <[email protected]> > wrote: > > Hi, > > > > Am 26.07.2013 um 14:22 schrieb Sergio Mafra: > > > > > I'm using MIT StarCluster with mpich2 and OGE. Everything's ok. > > > But when I tried to change the strategy of distribution of work from > Round Robin (default) to Fill Up... My problems had just began. > > > OGE keeps me teling that some nodes can not receive tasks... > > > > On the one hand this is a good sign, as it confirms that your PE is > defined to control slave tasks on the nodes. > > > > > > > "Error: executing task of job 9 failed: execution daemon on host > "node002" didn't accept task"It seems that my mpi app always tries to run > in all nodes of the cluster, no matter if OGE doesn't allow it to do it. > > > Does anybody knows of a workaround ? > > > > This indicates, that you application tries to use a node in the cluster, > which wasn't granted to this job by SGE. > > > > How does you application read the list of granted machines? > > > > Did you compile MPI on your own (which implementation in detail)? > > > > -- Reuti > > > > PS: I assume that with $round_robin simply all (or at least: many) nodes > were access allowed to. As now hosts are first filled before access to > another one is granted, you might see the effect of the former (possibly > wrong) distribution of slave tasks to the nodes. > > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
