Am 21.11.2012 um 19:44 schrieb François-Michel L'Heureux: > On Wed, Nov 21, 2012 at 12:14 PM, Reuti <[email protected]> wrote: > Am 21.11.2012 um 18:06 schrieb François-Michel L'Heureux: > > > On Wed, Nov 21, 2012 at 11:59 AM, Reuti <[email protected]> wrote: > > Am 21.11.2012 um 17:28 schrieb François-Michel L'Heureux: > > > > > Hi! > > > > > > Thanks for the reply. > > > > > > No, the job did not run. My launch command sets the verbose flag and > > > -now no. The first thing I get is > > > waiting for interactive job to be scheduled ... > > > > Yep, with -now n it will wait until resources are available. The default > > behavior would be to fail more or less instantly. > > > > > > > Which is good. Then nothing happens. Later, when I kill the jobs, I see a > > > mix of some > > > Your "qrsh" request could not be scheduled, try again later. popping in > > > my logs. > > > and > > > error: commlib error: got select error (No route to host) > > > and > > > > Is there a route to the host? > > Yes > > > > > > > error: commlib error: got select error (Connection timed out) > > > > > > It's strange that this is only received after the kill. > > > > > > From my terminal experience, qrsh can behave in a weird manner. When I > > > get an error message, the qrsh job is queued (and showed in qstat), but I > > > lose my handle over it. > > > > > > Regarding the dynamic cluster, my IPs are static for the duration of a > > > node life. Nodes can be added and removed. Their IPs won't change in the > > > middle of a run. But say that node3 is added with an IP, then removed, > > > then added back, the IP will not be the same. Might it be the cause? > > > > For SGE it would be a different node then with a different name. What's the > > reason for adding and removing nodes? > > We are working over Amazon with spot instances. We add/remove node based on > > the queue size and other factors. > > > > -- Reuti > > > > I'm onto something. When a job fails and the status is set to "Eqw", does > > it stay eternally into qstat output or does it get removed at some point? > > If they go away, that would explain the issue. > > It will stay in Eqw until you either delete the job or clear the flag with > `qmod -cj <jobid>`. > > > > > Also, in case it gives you any hint, when I run > > qacct -j | grep failed > > > > I can see the following failures > > 100 : assumedly after job > > This means to set the job into error state.
If it exits exactly with 100 => error state, and 99 => reschedule the job. > Is this intended to exit the job script with this error code? You will get > more than one entry in the accounting file when the job is rerun. > I don't understand what you mean there. I have control over this? My tests > shows that if I call "kill -9" on the process, that's what happens, but in > qacct -j it appears more often than I did kill jobs. What else can cause it? Any epilog exiting with 100? -- Reuti > > 37 : qmaster enforced h_rt limit > > Well, if h_rt is exceeded it's no wonder that it's killed and as a result > qrsh lost contact as the process on the node is killed, not the `qrsh` on the > login machine. > Ok this one comes from when an execution node goes away, the job is deleted > with qdel and this becomes the failed code. > > -- Reuti > > I'm trying to reproduce the issue anyway I can think of. My best lead was if > Eqw disappears after a while but if it doesn't, I have to look somewhere > else. > > > > > > Thanks > > > Mich > > > > > > > > > On Wed, Nov 21, 2012 at 10:55 AM, Reuti <[email protected]> > > > wrote: > > > Hi, > > > > > > Am 21.11.2012 um 16:10 schrieb François-Michel L'Heureux: > > > > > > > I have an issue where some jobs I call with the qrsh commands never > > > > appear into the queue. If I run the command "ps -ef | grep qrsh" I can > > > > see them. My setup > > > > > > Ok, but did it ever start on any node? > > > > > > > > > > is as follows: > > > > > > > > • I just have one process calling the grid engine via qrsh. This > > > > process resides on the master node. > > > > • I don't use nfs, I use sshfs instead. > > > > • I run over a dynamic cluster, which mean that at anytime nodes > > > > can be added or removed. > > > > Is anyone having an idea on what can cause the issue? I can counter it > > > > by looking at the process list when the queue is empty and > > > > killing/rescheduling those running a qrsh command, but I would rather > > > > prevent it. > > > > > > What do you mean by "dynamic cluster". SGE needs fixed addresses per node. > > > > > > -- Reuti > > > > > > > > > > Thanks > > > > Mich > > > > _______________________________________________ > > > > users mailing list > > > > [email protected] > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
