Reuti,

Problem solved. It was a hostname lookup problem. Once all the hosts had correct hosts files qrsh can now connect to any slave from all submit hosts. Thank you very much for your help over the last few days.

Below is the trace from shepherd on an exec host. The hostname lookup failure was not reported in this logfile. Would this be something that could be added in a future release. If other encounter this problem again it would have immediately identified the issue.

Many thanks,

Ian

<shepherd-trace>
03/05/2013 05:54:02 [0:6757]: shepherd called with uid = 0, euid = 0
03/05/2013 05:54:02 [0:6757]: rlogin_daemon = builtin
03/05/2013 05:54:02 [0:6757]: starting up 2011.11
03/05/2013 05:54:02 [0:6757]: setpgid(6757, 6757) returned 0
03/05/2013 05:54:02 [0:6757]: do_core_binding: "binding" parameter not found in config file
03/05/2013 05:54:02 [0:6757]: no prolog script to start
03/05/2013 05:54:02 [0:6757]: pipe to child uses fds 3 and 4
03/05/2013 05:54:02 [0:6757]: calling fork_pty()
03/05/2013 05:54:02 [0:6757]: parent: forked "job" with pid 6758
03/05/2013 05:54:02 [0:6758]: child: closing parents end of the pipe
03/05/2013 05:54:02 [0:6758]: child: trying to read from parent through the pipe
03/05/2013 05:54:02 [0:6757]: parent: job-pid: 6758
03/05/2013 05:54:02 [0:6757]: parent: closing childs end of the pipe
03/05/2013 05:54:02 [0:6757]: csp = 0
03/05/2013 05:54:02 [0:6757]: parent: starting parent loop with remote_host = broker, remote_port = 38675, job_owner = root, fd_pty_master = 5, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 4 03/05/2013 05:54:02 [0:6757]: parent: opening connection to qrsh/qlogin client
03/05/2013 05:54:02 [0:6757]: parent: can't open commlib stream, err_msg =
03/05/2013 05:54:02 [0:6757]: startup of qrsh job failed:
03/05/2013 05:54:02 [0:6758]: child: error communicating with parent: 0, Success
03/05/2013 05:54:02 [0:6758]: failed starting job
03/05/2013 05:54:02 [0:6758]: no epilog script to start
03/05/2013 05:54:02 [0:6758]: writing exit status to qrsh: 0
03/05/2013 05:54:02 [0:6758]: sending UNREGISTER_CTRL_MSG with exit_status = "0"
03/05/2013 05:54:02 [0:6758]: sending to host: <null>
03/05/2013 05:54:02 [0:6758]: comm_write_message returned: can't find handle 03/05/2013 05:54:02 [0:6758]: close_parent_loop: comm_write_message() returned 0 instead of 1!!!
03/05/2013 05:54:02 [0:6758]: waiting for UNREGISTER_RESPONSE_CTRL_MSG
03/05/2013 05:54:02 [0:6758]: No connection or problem while waiting for message: 1
03/05/2013 05:54:02 [0:6758]: parent: cl_com_ignore_timeouts
03/05/2013 05:54:02 [0:6758]: parent: error in comm_cleanup_lib(): 3
03/05/2013 05:54:02 [0:6758]: parent: leaving closinge_parent_loop()
</shepherd-trace>

On Tue, 05 Mar 2013 09:39:39 -0000, Ian Johnson <[email protected]> wrote:

Reuti,

I was wondering about both exit status being 0, of qrsh, and the error being set on the queue. The output of qacct is:

$ qacct -j 152
==============================================================
qname        all.q
hostname     exec_1
group        root
owner        root
project      NONE
department   defaultdepartment
jobname      QRLOGIN
jobnumber    152
taskid       undefined
account      sge
priority     0
qsub_time    Tue Mar  5 04:35:32 2013
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        1
failed       11  : before job
exit_status  0
ru_wallclock 0
ru_utime     0.000
ru_stime     0.000
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000
mem          0.000
io           0.000
iow          0.000
maxvmem      0.000
arid         undefined

qrsh works, however, from the master host, which is both a submit and administration host: as is the host I ran the "failing" qrsh process.

Thanks,

Ian

On Mon, 04 Mar 2013 17:36:47 -0000, Reuti <[email protected]> wrote:

Am 04.03.2013 um 14:27 schrieb Ian Johnson:

Dear All,

I built release 2011.11p1 of Open Grid Engine and I'm having a problem with qrsh not scheduling an interactive job on an execution host. Invoking:

$ qrsh -q all.q -verbose
local configuration broker not defined - using global configuration
Your job 152 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...
$ echo $?
0

And the exit status is 0!

However, the queue is left in an error state:

---------------------------------------------------------------------------------
all.q@exec_1 BIP 0/0/4 0.00 linux-x64 E queue all.q marked QERROR as result of job 152's failure at host exec_1
---------------------------------------------------------------------------------

Would anyone know what's going on here, or has anyone seen this behaviour before?

What created the error?

Are you know wondering about the exit code being zero, or the queue being in error state for unknown reason? There might be something in the messages file of the qmaster or the node specific one.

What was recorded in:

$ qacct -j 152

-- Reuti



--
Thank you,

Ian Johnson
Software Engineer

Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users





--
Kind regards,

Ian Johnson
Software Engineer

Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to