Hi. We've now shot the head node in the head (heh) and we're exploring killing off/restarting each execd on the compute nodes.
Do you recommend a kill -HUP on the process, or something more aggressive? This will in theory "kill" currently executing jobs on each compute host, we're assuming? Also, we just caught another one in the act, on one of the nodes that just threw the 137: [root@compute-0-6 ~]# tail -f /opt/gridengine/default/spool/compute-0-6/messages 01/17/2013 08:03:15| main|compute-0-6|W|reaping job "1371379" ptf complains: Job does not exist 01/17/2013 09:22:33| main|compute-0-6|W|reaping job "1371379" ptf complains: Job does not exist 01/17/2013 09:24:55| main|compute-0-6|W|reaping job "1371379" ptf complains: Job does not exist 01/17/2013 09:34:12| main|compute-0-6|W|reaping job "1371379" ptf complains: Job does not exist 01/17/2013 10:06:45| main|compute-0-6|E|removing unreferenced job 1371379.7545 without job report from ptf 01/17/2013 10:09:25| main|compute-0-6|W|reaping job "1371379" ptf complains: Job does not exist 01/18/2013 17:10:52| main|compute-0-6|W|can't register at qmaster "cluster.local": abort qmaster registration due to communication errors 01/18/2013 17:16:42| main|compute-0-6|W|gethostbyname(cluster.local) took 20 seconds and returns TRY_AGAIN 01/18/2013 17:25:37| main|compute-0-6|E|commlib error: got select error (No route to host) What's most unusual, about this, is that these time stamps don't match up with the error 137 we just saw. This example job was running for two days or so, then just became unhappy today, then threw the 137: Job 1307803 (b5_set11_9) Complete User = someguy Queue = [email protected] Host = compute-0-6.local Start Time = 01/14/2013 14:22:12 End Time = 01/21/2013 12:23:02 User Time = 6:21:24:07 System Time = 00:00:27 Wallclock Time = 6:22:00:50 CPU = 6:21:24:35 Max vmem = 13.302G Exit Status = 137 someguy@cluster run]$ qacct -j 1307803 ============================================================== qname medium.q hostname compute-0-6.local group users owner uqgmoser project NONE department defaultdepartment jobname b5_set11_9 jobnumber 1307803 taskid undefined account sge priority 0 qsub_time Mon Jan 14 14:22:04 2013 start_time Mon Jan 14 14:22:12 2013 end_time Mon Jan 21 12:23:02 2013 granted_pe NONE slots 1 failed 0 exit_status 137 ru_wallclock 597650 ru_utime 595447.475 ru_stime 27.902 ru_maxrss 13814492 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 46105 ru_majflt 33 ru_nswap 0 ru_inblock 19736 ru_oublock 160 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 786 ru_nivcsw 1520367 cpu 595475.377 mem 3938342.810 io 0.430 iow 0.000 maxvmem 13.302G arid undefined Thoughts, at this point? We're really running out of ideas now [apart from the most recent suggestion of a restart of the execd and queuemaster]. --JC On 19/01/13 2:16 AM, "Dave Love" <[email protected]> wrote: >Jake Carroll <[email protected]> writes: > >> Hi. >> >> Interesting. >> >> From /opt/gridengine/default/spool/compute-0-4/messages, we are seeing >> some unusual stuff (or, maybe it is entirely run of the mill?): > >I'm not sure whether it's the same as >https://arc.liv.ac.uk/trac/SGE/ticket/1418, which I haven't tried to >debug. It might be relevant, what operating system it is. > >> 01/16/2013 18:05:50| main|compute-0-4|W|reaping job "1350379" ptf >> complains: Job does not exist >> 01/16/2013 18:07:56| main|compute-0-4|E|removing unreferenced job >> 1350379.4111 without job report from ptf > >> At this point, we're scratching our heads and considering a reboot of >>the >> head node on Friday, as we really aren't understanding what is going >>wrong >> here. > >I'd restart the execd on the node, if anything, and possibly the >qmaster. I can't think rebooting the head would be useful. > >-- >Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
