Re: [gridengine users] Lost qrsh jobs

François-Michel L'Heureux Wed, 21 Nov 2012 09:08:41 -0800

On Wed, Nov 21, 2012 at 11:59 AM, Reuti <[email protected]> wrote:


> Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:
>
> > Hi!
> >
> > Thanks for the reply.
> >
> > No, the job did not run. My launch command sets the verbose flag  and
> -now no. The first thing I get is
> > waiting for interactive job to be scheduled ...
>
> Yep, with -now n it will wait until resources are available. The default
> behavior would be to fail more or less instantly.
>
>
> > Which is good. Then nothing happens. Later, when I kill the jobs, I see
> a mix of some
> > Your "qrsh" request could not be scheduled, try again later. popping in
> my logs.
> > and
> > error: commlib error: got select error (No route to host)
> > and
>
> Is there a route to the host?
>
Yes

>
>
> > error: commlib error: got select error (Connection timed out)
> >
> > It's strange that this is only received after the kill.
> >
> > From my terminal experience, qrsh can behave in a weird manner. When I
> get an error message, the qrsh job is queued (and showed in qstat), but I
> lose my handle over it.
> >
> > Regarding the dynamic cluster, my IPs are static for the duration of a
> node life. Nodes can be added and removed. Their IPs won't change in the
> middle of a run. But say that node3 is added with an IP, then removed, then
> added back, the IP will not be the same. Might it be the cause?
>
> For SGE it would be a different node then with a different name. What's
> the reason for adding and removing nodes?
>
We are working over Amazon with spot instances. We add/remove node based on
the queue size and other factors.

>
> -- Reuti
>

I'm onto something. When a job fails and the status is set to "Eqw", does
it stay eternally into qstat output or does it get removed at some point?
If they go away, that would explain the issue.

Also, in case it gives you any hint, when I run
qacct -j | grep failed

I can see the following failures
100 : assumedly after job
37  : qmaster enforced h_rt limit



>
> > Thanks
> > Mich
> >
> >
> > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> >
> > > I have an issue where some jobs I call with the qrsh commands never
> appear into the queue. If I run the command "ps -ef | grep qrsh" I can see
> them. My setup
> >
> > Ok, but did it ever start on any node?
> >
> >
> > > is as follows:
> > >
> > >       • I just have one process calling the grid engine via qrsh. This
> process resides on the master node.
> > >       • I don't use nfs, I use sshfs instead.
> > >       • I run over a dynamic cluster, which mean that at anytime nodes
> can be added or removed.
> > > Is anyone having an idea on what can cause the issue? I can counter it
> by looking at the process list when the queue is empty and
> killing/rescheduling those running a qrsh command, but I would rather
> prevent it.
> >
> > What do you mean by "dynamic cluster". SGE needs fixed addresses per
> node.
> >
> > -- Reuti
> >
> >
> > > Thanks
> > > Mich
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Lost qrsh jobs

Reply via email to