Re: [gridengine users] Lost qrsh jobs

Reuti Wed, 21 Nov 2012 09:01:22 -0800

Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux:

> Hi!
> 
> Thanks for the reply.
> 
> No, the job did not run. My launch command sets the verbose flag  and -now 
> no. The first thing I get is
> waiting for interactive job to be scheduled ...


Yep, with -now n it will wait until resources are available. The default 
behavior would be to fail more or less instantly.


> Which is good. Then nothing happens. Later, when I kill the jobs, I see a mix 
> of some 
> Your "qrsh" request could not be scheduled, try again later. popping in my 
> logs.
> and
> error: commlib error: got select error (No route to host) 
> and

Is there a route to the host?


> error: commlib error: got select error (Connection timed out) 
> 
> It's strange that this is only received after the kill.
> 
> From my terminal experience, qrsh can behave in a weird manner. When I get an 
> error message, the qrsh job is queued (and showed in qstat), but I lose my 
> handle over it.
> 
> Regarding the dynamic cluster, my IPs are static for the duration of a node 
> life. Nodes can be added and removed. Their IPs won't change in the middle of 
> a run. But say that node3 is added with an IP, then removed, then added back, 
> the IP will not be the same. Might it be the cause?

For SGE it would be a different node then with a different name. What's the 
reason for adding and removing nodes?

-- Reuti


> Thanks
> Mich
> 
> 
> On Wed, Nov 21, 2012 at 10:55 AM, Reuti <[email protected]> wrote:
> Hi,
> 
> Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux:
> 
> > I have an issue where some jobs I call with the qrsh commands never appear 
> > into the queue. If I run the command "ps -ef | grep qrsh" I can see them. 
> > My setup
> 
> Ok, but did it ever start on any node?
> 
> 
> > is as follows:
> >
> >       • I just have one process calling the grid engine via qrsh. This 
> > process resides on the master node.
> >       • I don't use nfs, I use sshfs instead.
> >       • I run over a dynamic cluster, which mean that at anytime nodes can 
> > be added or removed.
> > Is anyone having an idea on what can cause the issue? I can counter it by 
> > looking at the process list when the queue is empty and 
> > killing/rescheduling those running a qrsh command, but I would rather 
> > prevent it.
> 
> What do you mean by "dynamic cluster". SGE needs fixed addresses per node.
> 
> -- Reuti
> 
> 
> > Thanks
> > Mich
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Lost qrsh jobs

Reply via email to