"MacMullan, Hugh" <[email protected]> writes: > Thanks! That's what I thought. Bah. I'll try to push this with them ... it > really would be nice if it was a more flexible implementation anyway. Like > logging ... it opens a .o and .e file for each kernel, which gets pretty ugly > when launching 65 kernels over and over again. :)
I don't see what they can do about communication timeouts on the system, and I don't see how drmaa_wait can be relevant. The error appears to be from the initial session setup, and the drmaa_wait timeout is unrelated to communication timeouts. Consider the backtrace: cl_com_setup_commlib isn't documented via http://arc.liv.ac.uk/SGE/adoc/libcomm.html, as it should be, but the Java method is under http://arc.liv.ac.uk/SGE/javadocs/jdrmaa/com/sun/grid/drmaa/SessionImpl.html The communication failure needs debugging. Do other SGE clients even work, specifically qsub? Is there anything useful in the qmaster messages file? What does tcpdump etc. show for the attempted connexion? The basic communication tool is qping(1). If you really feel a need to change the communication timeout and it's not documented in your sge_conf(5), look for the gdi_... qmaster_params in http://arc.liv.ac.uk/SGE/htmlman/htmlman5/sge_conf.html and possibly also cl_ping. (I can't remember which daemon parameters weren't documented previously, but the documentation there is actively maintained.) -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
