"MacMullan, Hugh" <[email protected]> writes:

> Thanks! That's what I thought. Bah. I'll try to push this with them ... it 
> really would be nice if it was a more flexible implementation anyway. Like 
> logging ... it opens a .o and .e file for each kernel, which gets pretty ugly 
> when launching 65 kernels over and over again. :)

I don't see what they can do about communication timeouts on the system,
and I don't see how drmaa_wait can be relevant.

The error appears to be from the initial session setup, and the
drmaa_wait timeout is unrelated to communication timeouts.  Consider the
backtrace:  cl_com_setup_commlib isn't documented via
http://arc.liv.ac.uk/SGE/adoc/libcomm.html, as it should be, but the
Java method is under
http://arc.liv.ac.uk/SGE/javadocs/jdrmaa/com/sun/grid/drmaa/SessionImpl.html

The communication failure needs debugging.  Do other SGE clients even
work, specifically qsub?  Is there anything useful in the qmaster
messages file?  What does tcpdump etc. show for the attempted connexion?
The basic communication tool is qping(1).

If you really feel a need to change the communication timeout and it's
not documented in your sge_conf(5), look for the gdi_... qmaster_params
in http://arc.liv.ac.uk/SGE/htmlman/htmlman5/sge_conf.html and possibly
also cl_ping.  (I can't remember which daemon parameters weren't
documented previously, but the documentation there is actively
maintained.)

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to