Reuti,

To reproduce the shepherd issue:

1. Add a submit host that is *not* a/the master host.
2. Ensure that the submit host's hostname *cannot* be resolved by any execution host.
3. Submit a qrsh job on the new submit host.

Result:

 - qrsh in step 3 will return exit status of 0.
 - Leaves the queue in an E state.

If you need any more help, please don't hesitate to get in touch.

Ian


On Wed, 06 Mar 2013 18:34:40 -0000, Dave Love <[email protected]> wrote:

Ian Johnson <[email protected]> writes:

Reuti,

Problem solved. It was a hostname lookup problem. Once all the hosts
had correct hosts files qrsh can now connect to any slave from all
submit  hosts. Thank you very much for your help over the last few
days.

Below is the trace from shepherd on an exec host. The hostname lookup
failure was not reported in this logfile. Would this be something that
could be added in a future release. If other encounter this problem
again  it would have immediately identified the issue.

If you can tell how to reproduce it, I'll fix any such crash in SGE, and
a fix might turn up uncredited in OGS eventually.  Was hosts wrong at
the qrsh client or server end, or both, and are you using builtin or
external remote startup?



--
Kind regards,

Ian Johnson
Software Engineer

Capita Translation and Interpreting
Riverside Court, Huddersfield Road, Delph, Oldham, OL3 5FZ | Tel (UK): +44 845 367 7000 | Tel (US): +1 (800) 579-5010
| [email protected] | Skype ID: ian.johnson_als
www.capitatranslationinterpreting.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to