Re: [gridengine users] Lost qrsh jobs

Reuti Thu, 22 Nov 2012 07:24:18 -0800

Am 21.11.2012 um 19:44 schrieb François-Michel L'Heureux:

> On Wed, Nov 21, 2012 at 12:14 PM, Reuti <[email protected]> wrote:
> Am 21.11.2012 um 18:06 schrieb François-Michel L'Heureux:
> 
> > On Wed, Nov 21, 2012 at 11:59 AM, Reuti <[email protected]> wrote:
> > Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:
> >
> > > Hi!
> > >
> > > Thanks for the reply.
> > >
> > > No, the job did not run. My launch command sets the verbose flag  and 
> > > -now no. The first thing I get is
> > > waiting for interactive job to be scheduled ...
> >
> > Yep, with -now n it will wait until resources are available. The default 
> > behavior would be to fail more or less instantly.
> >
> >
> > > Which is good. Then nothing happens. Later, when I kill the jobs, I see a 
> > > mix of some
> > > Your "qrsh" request could not be scheduled, try again later. popping in 
> > > my logs.
> > > and
> > > error: commlib error: got select error (No route to host)
> > > and
> >
> > Is there a route to the host?
> > Yes
> >
> >
> > > error: commlib error: got select error (Connection timed out)
> > >
> > > It's strange that this is only received after the kill.
> > >
> > > From my terminal experience, qrsh can behave in a weird manner. When I 
> > > get an error message, the qrsh job is queued (and showed in qstat), but I 
> > > lose my handle over it.
> > >
> > > Regarding the dynamic cluster, my IPs are static for the duration of a 
> > > node life. Nodes can be added and removed. Their IPs won't change in the 
> > > middle of a run. But say that node3 is added with an IP, then removed, 
> > > then added back, the IP will not be the same. Might it be the cause?
> >
> > For SGE it would be a different node then with a different name. What's the 
> > reason for adding and removing nodes?
> > We are working over Amazon with spot instances. We add/remove node based on 
> > the queue size and other factors.
> >
> > -- Reuti
> >
> > I'm onto something. When a job fails and the status is set to "Eqw", does 
> > it stay eternally into qstat output or does it get removed at some point? 
> > If they go away, that would explain the issue.
> 
> It will stay in Eqw until you either delete the job or clear the flag with 
> `qmod -cj <jobid>`. 
> 
> >
> > Also, in case it gives you any hint, when I run
> > qacct -j | grep failed
> >
> > I can see the following failures
> > 100 : assumedly after job
> 
> This means to set the job into error state.


If it exits exactly with 100 => error state, and 99 => reschedule the job.


> Is this intended to exit the job script with this error code? You will get 
> more than one entry in the accounting file when the job is rerun.
> I don't understand what you mean there. I have control over this? My tests 
> shows that if I call "kill -9" on the process, that's what happens, but in 
> qacct -j it appears more often than I did kill jobs. What else can cause it?

Any epilog exiting with 100?

-- Reuti


> > 37  : qmaster enforced h_rt limit
> 
> Well, if h_rt is exceeded it's no wonder that it's killed and as a result 
> qrsh lost contact as the process on the node is killed, not the `qrsh` on the 
> login machine.
> Ok this one comes from when an execution node goes away, the job is deleted 
> with qdel and this becomes the failed code.
> 
> -- Reuti
> 
> I'm trying to reproduce the issue anyway I can think of. My best lead was if 
> Eqw disappears after a while  but if it doesn't, I have to look somewhere 
> else.
> 
> 
> 
> > > Thanks
> > > Mich
> > >
> > >
> > > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <[email protected]> 
> > > wrote:
> > > Hi,
> > >
> > > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> > >
> > > > I have an issue where some jobs I call with the qrsh commands never 
> > > > appear into the queue. If I run the command "ps -ef | grep qrsh" I can 
> > > > see them. My setup
> > >
> > > Ok, but did it ever start on any node?
> > >
> > >
> > > > is as follows:
> > > >
> > > >       • I just have one process calling the grid engine via qrsh. This 
> > > > process resides on the master node.
> > > >       • I don't use nfs, I use sshfs instead.
> > > >       • I run over a dynamic cluster, which mean that at anytime nodes 
> > > > can be added or removed.
> > > > Is anyone having an idea on what can cause the issue? I can counter it 
> > > > by looking at the process list when the queue is empty and 
> > > > killing/rescheduling those running a qrsh command, but I would rather 
> > > > prevent it.
> > >
> > > What do you mean by "dynamic cluster". SGE needs fixed addresses per node.
> > >
> > > -- Reuti
> > >
> > >
> > > > Thanks
> > > > Mich
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Lost qrsh jobs

Reply via email to