Reuti,
Problem solved. It was a hostname lookup problem. Once all the hosts had
correct hosts files qrsh can now connect to any slave from all submit
hosts. Thank you very much for your help over the last few days.
Below is the trace from shepherd on an exec host. The hostname lookup
failure was not reported in this logfile. Would this be something that
could be added in a future release. If other encounter this problem again
it would have immediately identified the issue.
Many thanks,
Ian
<shepherd-trace>
03/05/2013 05:54:02 [0:6757]: shepherd called with uid = 0, euid = 0
03/05/2013 05:54:02 [0:6757]: rlogin_daemon = builtin
03/05/2013 05:54:02 [0:6757]: starting up 2011.11
03/05/2013 05:54:02 [0:6757]: setpgid(6757, 6757) returned 0
03/05/2013 05:54:02 [0:6757]: do_core_binding: "binding" parameter not
found in config file
03/05/2013 05:54:02 [0:6757]: no prolog script to start
03/05/2013 05:54:02 [0:6757]: pipe to child uses fds 3 and 4
03/05/2013 05:54:02 [0:6757]: calling fork_pty()
03/05/2013 05:54:02 [0:6757]: parent: forked "job" with pid 6758
03/05/2013 05:54:02 [0:6758]: child: closing parents end of the pipe
03/05/2013 05:54:02 [0:6758]: child: trying to read from parent through
the pipe
03/05/2013 05:54:02 [0:6757]: parent: job-pid: 6758
03/05/2013 05:54:02 [0:6757]: parent: closing childs end of the pipe
03/05/2013 05:54:02 [0:6757]: csp = 0
03/05/2013 05:54:02 [0:6757]: parent: starting parent loop with
remote_host = broker, remote_port = 38675, job_owner = root, fd_pty_master
= 5, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child
= 4
03/05/2013 05:54:02 [0:6757]: parent: opening connection to qrsh/qlogin
client
03/05/2013 05:54:02 [0:6757]: parent: can't open commlib stream, err_msg =
03/05/2013 05:54:02 [0:6757]: startup of qrsh job failed:
03/05/2013 05:54:02 [0:6758]: child: error communicating with parent: 0,
Success
03/05/2013 05:54:02 [0:6758]: failed starting job
03/05/2013 05:54:02 [0:6758]: no epilog script to start
03/05/2013 05:54:02 [0:6758]: writing exit status to qrsh: 0
03/05/2013 05:54:02 [0:6758]: sending UNREGISTER_CTRL_MSG with exit_status
= "0"
03/05/2013 05:54:02 [0:6758]: sending to host: <null>
03/05/2013 05:54:02 [0:6758]: comm_write_message returned: can't find
handle
03/05/2013 05:54:02 [0:6758]: close_parent_loop: comm_write_message()
returned 0 instead of 1!!!
03/05/2013 05:54:02 [0:6758]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
03/05/2013 05:54:02 [0:6758]: No connection or problem while waiting for
message: 1
03/05/2013 05:54:02 [0:6758]: parent: cl_com_ignore_timeouts
03/05/2013 05:54:02 [0:6758]: parent: error in comm_cleanup_lib(): 3
03/05/2013 05:54:02 [0:6758]: parent: leaving closinge_parent_loop()
</shepherd-trace>
On Tue, 05 Mar 2013 09:39:39 -0000, Ian Johnson
<[email protected]> wrote:
Reuti,
I was wondering about both exit status being 0, of qrsh, and the error
being set on the queue. The output of qacct is:
$ qacct -j 152
==============================================================
qname all.q
hostname exec_1
group root
owner root
project NONE
department defaultdepartment
jobname QRLOGIN
jobnumber 152
taskid undefined
account sge
priority 0
qsub_time Tue Mar 5 04:35:32 2013
start_time -/-
end_time -/-
granted_pe NONE
slots 1
failed 11 : before job
exit_status 0
ru_wallclock 0
ru_utime 0.000
ru_stime 0.000
ru_maxrss 0
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 0
ru_majflt 0
ru_nswap 0
ru_inblock 0
ru_oublock 0
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 0
ru_nivcsw 0
cpu 0.000
mem 0.000
io 0.000
iow 0.000
maxvmem 0.000
arid undefined
qrsh works, however, from the master host, which is both a submit and
administration host: as is the host I ran the "failing" qrsh process.
Thanks,
Ian
On Mon, 04 Mar 2013 17:36:47 -0000, Reuti <[email protected]>
wrote:
Am 04.03.2013 um 14:27 schrieb Ian Johnson:
Dear All,
I built release 2011.11p1 of Open Grid Engine and I'm having a problem
with qrsh not scheduling an interactive job on an execution host.
Invoking:
$ qrsh -q all.q -verbose
local configuration broker not defined - using global configuration
Your job 152 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...
$ echo $?
0
And the exit status is 0!
However, the queue is left in an error state:
---------------------------------------------------------------------------------
all.q@exec_1 BIP 0/0/4 0.00
linux-x64 E
queue all.q marked QERROR as result of job 152's failure at host
exec_1
---------------------------------------------------------------------------------
Would anyone know what's going on here, or has anyone seen this
behaviour before?
What created the error?
Are you know wondering about the exit code being zero, or the queue
being in error state for unknown reason? There might be something in
the messages file of the qmaster or the node specific one.
What was recorded in:
$ qacct -j 152
-- Reuti
--
Thank you,
Ian Johnson
Software Engineer
Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK):
+44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Kind regards,
Ian Johnson
Software Engineer
Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44
845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users