Hi,
For a low-level gridengine test, you can use the 'qping' command between
various grid engine daemons.
I would guess that during the time that you see the commlib error, your
qping also wouldn't work.
E.g. from an exec host:
qping -info <QMASTER_NAME> 6444 qmaster 1
See the qping man page for more info. That should help you see if
there's an intermittent network problem of some kind.
Regards,
Alex
On 09/05/2012 01:07 PM, Brodie, Kent wrote:
Hi-- we’re using Sun Grid Engine for our Illumina jobs, and are having
a bear of a time getting things to finish without blowing up. Almost
every job submission, we end up seeing errors like this after several
hours. I really can find nothing else in the SGE logs to tell me
what’s going on.
We have a cluster of Dell R610’s with a dedicated qmaster node.
Connections to shared data are all via 10-gig Isilon. Spool
directories (classic) are local to each node.
00:02:39] [cairo]
[6cyc_5pm_NoIndex_L006_R1_008_eland_extended.txt.oa] error: commlib
error: got read error (closing "cairo/shepherd_ijs/1")
How can I go from this kind of message (commlib error) to something
that’s more meaningful?
Thanks for ANY insight wit this! --Kent
-- 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users