(my apologies, I am still new at trying to work out failures over here).
This is as relates to my earlier fight with Illumina jobs on my cluster.
With 4 or 5 processes per node, worked fine. Cranked it up higher, I got this:
"error: executing task of job 22130 failed: failed sending task to execd@rome:
can't find connection"
OK, so I now know I have some kind of communications issue... beyond the
error above, I find nothing else on the qmaster host. ('messages' file in the
qmaster spool).
When I looked at the exec host, I saw these:
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:54| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:55| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:55| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:55| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
09/21/2012 06:18:55| main|rome|W|reaping job "22130" ptf complains: Job does
not exist
...... but then discovered that I got these on EVERY exec host, not just the
one mentioned in the error. I am guessing that's because the job died?
Here's the weird part, qacct on the job indicates and exit status of 0.
[root@lima ~]# qacct -j 22130
==============================================================
qname casava
hostname sofia
group kcb
owner kcb
project NONE
department defaultdepartment
jobname AlignJob
jobnumber 22130
taskid undefined
account sge
priority 0
qsub_time Thu Sep 20 19:26:28 2012
start_time Thu Sep 20 19:26:37 2012
end_time Fri Sep 21 09:26:40 2012
granted_pe make
slots 56
failed 0
exit_status 0
ru_wallclock 50403
ru_utime 2658384.732
ru_stime 16415.789
ru_maxrss 488183180
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 1073239146
ru_majflt 57643
ru_nswap 0
ru_inblock 0
ru_oublock 0
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 51856043
ru_nivcsw 257585084
cpu 2674800.521
mem 1532036.420
io 11164.739
iow 0.000
maxvmem 694.979G
arid undefined
What else do I look at to see what's going on?
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users