"Brodie, Kent" <[email protected]> writes: > Hi-- we're using Sun Grid Engine for our Illumina jobs, and are having > a bear of a time getting things to finish without blowing up. Almost > every job submission, we end up seeing errors like this after several > hours. I really can find nothing else in the SGE logs to tell me > what's going on.
What about syslog? If it's Linux-based, you might see kernel reports of SEGVs or OOM murder. > We have a cluster of Dell R610's with a dedicated qmaster node. > Connections to shared data are all via 10-gig Isilon. Spool directories > (classic) are local to each node. > > 00:02:39] [cairo] [6cyc_5pm_NoIndex_L006_R1_008_eland_extended.txt.oa] > error: commlib error: got read error (closing "cairo/shepherd_ijs/1") Where's that message from -- qrsh? I don't remember what you might expect to see where. > How can I go from this kind of message (commlib error) to something > that's more meaningful? First of all turn the log level up to "info", though I doubt that will help much here. I guess the shepherd is dying. It might help to preserve the active_jobs directory and look at the shepherd messages (KEEP_ACTIVE in sge_conf(5)). On GNU/Linux, you could try to get core dumps by using the libcore hack, e.g. the source from <https://arc.liv.ac.uk/trac/SGE/browser/sge/source/libs/libcore/libcore.c?rev=3578>. That might at least show that something was crashing. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
