Re: [gridengine users] QRSH exiting with 0 but sets Error on queue

Ian Johnson Wed, 06 Mar 2013 03:51:23 -0800

Reuti,

Problem solved. It was a hostname lookup problem. Once all the hosts hadcorrect hosts files qrsh can now connect to any slave from all submithosts. Thank you very much for your help over the last few days.

Below is the trace from shepherd on an exec host. The hostname lookupfailure was not reported in this logfile. Would this be something thatcould be added in a future release. If other encounter this problem againit would have immediately identified the issue.


Many thanks,

Ian

<shepherd-trace>
03/05/2013 05:54:02 [0:6757]: shepherd called with uid = 0, euid = 0
03/05/2013 05:54:02 [0:6757]: rlogin_daemon = builtin
03/05/2013 05:54:02 [0:6757]: starting up 2011.11
03/05/2013 05:54:02 [0:6757]: setpgid(6757, 6757) returned 0

03/05/2013 05:54:02 [0:6757]: do_core_binding: "binding" parameter notfound in config file

03/05/2013 05:54:02 [0:6757]: no prolog script to start
03/05/2013 05:54:02 [0:6757]: pipe to child uses fds 3 and 4
03/05/2013 05:54:02 [0:6757]: calling fork_pty()
03/05/2013 05:54:02 [0:6757]: parent: forked "job" with pid 6758
03/05/2013 05:54:02 [0:6758]: child: closing parents end of the pipe

03/05/2013 05:54:02 [0:6758]: child: trying to read from parent throughthe pipe

03/05/2013 05:54:02 [0:6757]: parent: job-pid: 6758
03/05/2013 05:54:02 [0:6757]: parent: closing childs end of the pipe
03/05/2013 05:54:02 [0:6757]: csp = 0

03/05/2013 05:54:02 [0:6757]: parent: starting parent loop withremote_host = broker, remote_port = 38675, job_owner = root, fd_pty_master= 5, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child= 403/05/2013 05:54:02 [0:6757]: parent: opening connection to qrsh/qloginclient

03/05/2013 05:54:02 [0:6757]: parent: can't open commlib stream, err_msg =
03/05/2013 05:54:02 [0:6757]: startup of qrsh job failed:

03/05/2013 05:54:02 [0:6758]: child: error communicating with parent: 0,Success

03/05/2013 05:54:02 [0:6758]: failed starting job
03/05/2013 05:54:02 [0:6758]: no epilog script to start
03/05/2013 05:54:02 [0:6758]: writing exit status to qrsh: 0

03/05/2013 05:54:02 [0:6758]: sending UNREGISTER_CTRL_MSG with exit_status= "0"

03/05/2013 05:54:02 [0:6758]: sending to host: <null>

03/05/2013 05:54:02 [0:6758]: comm_write_message returned: can't findhandle03/05/2013 05:54:02 [0:6758]: close_parent_loop: comm_write_message()returned 0 instead of 1!!!

03/05/2013 05:54:02 [0:6758]: waiting for UNREGISTER_RESPONSE_CTRL_MSG

03/05/2013 05:54:02 [0:6758]: No connection or problem while waiting formessage: 1

03/05/2013 05:54:02 [0:6758]: parent: cl_com_ignore_timeouts
03/05/2013 05:54:02 [0:6758]: parent: error in comm_cleanup_lib(): 3
03/05/2013 05:54:02 [0:6758]: parent: leaving closinge_parent_loop()
</shepherd-trace>

On Tue, 05 Mar 2013 09:39:39 -0000, Ian Johnson<[email protected]> wrote:

Reuti,
I was wondering about both exit status being 0, of qrsh, and the errorbeing set on the queue. The output of qacct is:
$ qacct -j 152
==============================================================
qname        all.q
hostname     exec_1
group        root
owner        root
project      NONE
department   defaultdepartment
jobname      QRLOGIN
jobnumber    152
taskid       undefined
account      sge
priority     0
qsub_time    Tue Mar  5 04:35:32 2013
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       11  : before job
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined
qrsh works, however, from the master host, which is both a submit andadministration host: as is the host I ran the "failing" qrsh process.
Thanks,

Ian
On Mon, 04 Mar 2013 17:36:47 -0000, Reuti <[email protected]>wrote:
Am 04.03.2013 um 14:27 schrieb Ian Johnson:
Dear All,
I built release 2011.11p1 of Open Grid Engine and I'm having a problemwith qrsh not scheduling an interactive job on an execution host.Invoking:
$ qrsh -q all.q -verbose
local configuration broker not defined - using global configuration
Your job 152 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...
$ echo $?
0

And the exit status is 0!

However, the queue is left in an error state:

---------------------------------------------------------------------------------
all.q@exec_1 BIP 0/0/4 0.00linux-x64 Equeue all.q marked QERROR as result of job 152's failure at hostexec_1
---------------------------------------------------------------------------------
Would anyone know what's going on here, or has anyone seen thisbehaviour before?
What created the error?
Are you know wondering about the exit code being zero, or the queuebeing in error state for unknown reason? There might be something inthe messages file of the qmaster or the node specific one.
What was recorded in:

$ qacct -j 152

-- Reuti
--
Thank you,

Ian Johnson
Software Engineer

Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK):+44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users



--
Kind regards,

Ian Johnson
Software Engineer

Capita Translation and Interpreting

Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44845 367 7000 | Tel (US): +1 (800) 579-5010

| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] QRSH exiting with 0 but sets Error on queue

Reply via email to