Am 21.11.2012 um 18:06 schrieb François-Michel L'Heureux:

> On Wed, Nov 21, 2012 at 11:59 AM, Reuti <[email protected]> wrote:
> Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:
> 
> > Hi!
> >
> > Thanks for the reply.
> >
> > No, the job did not run. My launch command sets the verbose flag  and -now 
> > no. The first thing I get is
> > waiting for interactive job to be scheduled ...
> 
> Yep, with -now n it will wait until resources are available. The default 
> behavior would be to fail more or less instantly.
> 
> 
> > Which is good. Then nothing happens. Later, when I kill the jobs, I see a 
> > mix of some
> > Your "qrsh" request could not be scheduled, try again later. popping in my 
> > logs.
> > and
> > error: commlib error: got select error (No route to host)
> > and
> 
> Is there a route to the host?
> Yes 
> 
> 
> > error: commlib error: got select error (Connection timed out)
> >
> > It's strange that this is only received after the kill.
> >
> > From my terminal experience, qrsh can behave in a weird manner. When I get 
> > an error message, the qrsh job is queued (and showed in qstat), but I lose 
> > my handle over it.
> >
> > Regarding the dynamic cluster, my IPs are static for the duration of a node 
> > life. Nodes can be added and removed. Their IPs won't change in the middle 
> > of a run. But say that node3 is added with an IP, then removed, then added 
> > back, the IP will not be the same. Might it be the cause?
> 
> For SGE it would be a different node then with a different name. What's the 
> reason for adding and removing nodes?
> We are working over Amazon with spot instances. We add/remove node based on 
> the queue size and other factors. 
> 
> -- Reuti
> 
> I'm onto something. When a job fails and the status is set to "Eqw", does it 
> stay eternally into qstat output or does it get removed at some point? If 
> they go away, that would explain the issue.

It will stay in Eqw until you either delete the job or clear the flag with 
`qmod -cj <jobid>`.

> 
> Also, in case it gives you any hint, when I run 
> qacct -j | grep failed
> 
> I can see the following failures
> 100 : assumedly after job

This means to set the job into error state. Is this intended to exit the job 
script with this error code? You will get more than one entry in the accounting 
file when the job is rerun.


> 37  : qmaster enforced h_rt limit

Well, if h_rt is exceeded it's no wonder that it's killed and as a result qrsh 
lost contact as the process on the node is killed, not the `qrsh` on the login 
machine.

-- Reuti



> > Thanks
> > Mich
> >
> >
> > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <[email protected]> wrote:
> > Hi,
> >
> > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> >
> > > I have an issue where some jobs I call with the qrsh commands never 
> > > appear into the queue. If I run the command "ps -ef | grep qrsh" I can 
> > > see them. My setup
> >
> > Ok, but did it ever start on any node?
> >
> >
> > > is as follows:
> > >
> > >       • I just have one process calling the grid engine via qrsh. This 
> > > process resides on the master node.
> > >       • I don't use nfs, I use sshfs instead.
> > >       • I run over a dynamic cluster, which mean that at anytime nodes 
> > > can be added or removed.
> > > Is anyone having an idea on what can cause the issue? I can counter it by 
> > > looking at the process list when the queue is empty and 
> > > killing/rescheduling those running a qrsh command, but I would rather 
> > > prevent it.
> >
> > What do you mean by "dynamic cluster". SGE needs fixed addresses per node.
> >
> > -- Reuti
> >
> >
> > > Thanks
> > > Mich
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to