(my apologies, I am still new at trying to work out failures over here).

This is as relates to my earlier fight with Illumina jobs on my cluster.    
With 4 or 5 processes per node, worked fine.   Cranked it up higher, I got this:


"error: executing task of job 22130 failed: failed sending task to execd@rome: 
can't find connection"


OK, so I now know I have some kind of communications issue...     beyond the 
error above, I find nothing else on the qmaster host.   ('messages' file in the 
qmaster spool).

When I looked at the exec host, I saw these:

09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:54|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:55|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:55|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:55|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist
09/21/2012 06:18:55|  main|rome|W|reaping job "22130" ptf complains: Job does 
not exist

......  but then discovered that I got these on EVERY exec host, not just the 
one mentioned in the error.   I am guessing that's because the job died?

Here's the weird part, qacct on the job indicates and exit status of 0.

[root@lima ~]# qacct -j 22130
==============================================================
qname        casava
hostname     sofia
group        kcb
owner        kcb
project      NONE
department   defaultdepartment
jobname      AlignJob
jobnumber    22130
taskid       undefined
account      sge
priority     0
qsub_time    Thu Sep 20 19:26:28 2012
start_time   Thu Sep 20 19:26:37 2012
end_time     Fri Sep 21 09:26:40 2012
granted_pe   make
slots        56
failed       0
exit_status  0
ru_wallclock 50403
ru_utime     2658384.732
ru_stime     16415.789
ru_maxrss    488183180
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    1073239146
ru_majflt    57643
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     51856043
ru_nivcsw    257585084
cpu          2674800.521
mem          1532036.420
io           11164.739
iow          0.000
maxvmem      694.979G
arid         undefined

What else do I look at to see what's going on?




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to