Hi,

For a low-level gridengine test, you can use the 'qping' command between various grid engine daemons.

I would guess that during the time that you see the commlib error, your qping also wouldn't work.

E.g. from an exec host:

qping -info <QMASTER_NAME> 6444 qmaster 1

See the qping man page for more info. That should help you see if there's an intermittent network problem of some kind.

Regards,
Alex

On 09/05/2012 01:07 PM, Brodie, Kent wrote:
Hi--  we’re using Sun Grid Engine for our Illumina jobs, and are having
a bear of a time getting things to finish without blowing up.     Almost
every job submission, we end up seeing errors like this after several
hours.   I really can find nothing else in the SGE logs to tell me
what’s going on.

We have a cluster of Dell R610’s with a dedicated qmaster node.
Connections to shared data are all via 10-gig Isilon.   Spool
directories (classic) are local to each node.

00:02:39]   [cairo]
[6cyc_5pm_NoIndex_L006_R1_008_eland_extended.txt.oa]    error: commlib
error: got read error (closing "cairo/shepherd_ijs/1")

How can I go from this kind of message (commlib error) to something
that’s more meaningful?

Thanks for ANY insight wit this!  --Kent

-- 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to