"Brodie, Kent" <[email protected]> writes:

> Hi-- we're using Sun Grid Engine for our Illumina jobs, and are having
> a bear of a time getting things to finish without blowing up.  Almost
> every job submission, we end up seeing errors like this after several
> hours.  I really can find nothing else in the SGE logs to tell me
> what's going on.

What about syslog?  If it's Linux-based, you might see kernel reports of
SEGVs or OOM murder.

> We have a cluster of Dell R610's with a dedicated qmaster node.    
> Connections to shared data are all via 10-gig Isilon.   Spool directories 
> (classic) are local to each node.
>
> 00:02:39]   [cairo] [6cyc_5pm_NoIndex_L006_R1_008_eland_extended.txt.oa]    
> error: commlib error: got read error (closing "cairo/shepherd_ijs/1")

Where's that message from -- qrsh?  I don't remember what you might
expect to see where.

> How can I go from this kind of message (commlib error) to something
> that's more meaningful?

First of all turn the log level up to "info", though I doubt that will
help much here.  I guess the shepherd is dying.  It might help to
preserve the active_jobs directory and look at the shepherd messages
(KEEP_ACTIVE in sge_conf(5)).  On GNU/Linux, you could try to get core
dumps by using the libcore hack, e.g. the source from
<https://arc.liv.ac.uk/trac/SGE/browser/sge/source/libs/libcore/libcore.c?rev=3578>.
That might at least show that something was crashing.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to