Re: [gridengine users] Error 137 - trying to figure out what it means.

Jake Carroll Sun, 20 Jan 2013 19:54:15 -0800

Hi.

We've now shot the head node in the head (heh) and we're exploring killing
off/restarting each execd on the compute nodes.


Do you recommend a kill -HUP on the process, or something more aggressive?
This will in theory "kill" currently executing jobs on each compute host,
we're assuming?

Also, we just caught another one in the act, on one of the nodes that just
threw the 137:

[root@compute-0-6 ~]# tail -f
/opt/gridengine/default/spool/compute-0-6/messages
01/17/2013 08:03:15|  main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:22:33|  main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:24:55|  main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:34:12|  main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 10:06:45|  main|compute-0-6|E|removing unreferenced job
1371379.7545 without job report from ptf
01/17/2013 10:09:25|  main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/18/2013 17:10:52|  main|compute-0-6|W|can't register at qmaster
"cluster.local": abort qmaster registration due to communication errors
01/18/2013 17:16:42|  main|compute-0-6|W|gethostbyname(cluster.local) took
20 seconds and returns TRY_AGAIN

01/18/2013 17:25:37|  main|compute-0-6|E|commlib error: got select error
(No route to host)

What's most unusual, about this, is that these time stamps don't match up
with the error 137 we just saw.


This example job was running for two days or so, then just became unhappy
today, then threw the 137:

Job 1307803 (b5_set11_9) Complete
User             = someguy
Queue            = [email protected]
Host             = compute-0-6.local
Start Time       = 01/14/2013 14:22:12
End Time         = 01/21/2013 12:23:02
User Time        = 6:21:24:07
System Time      = 00:00:27
Wallclock Time   = 6:22:00:50
CPU              = 6:21:24:35
Max vmem         = 13.302G
Exit Status      = 137



someguy@cluster run]$ qacct -j 1307803
==============================================================
qname        medium.q
hostname     compute-0-6.local
group        users 
owner        uqgmoser
project      NONE  
department   defaultdepartment
jobname      b5_set11_9
jobnumber    1307803
taskid       undefined
account      sge   
priority     0     
qsub_time    Mon Jan 14 14:22:04 2013
start_time   Mon Jan 14 14:22:12 2013
end_time     Mon Jan 21 12:23:02 2013
granted_pe   NONE  
slots        1     
failed       0    
exit_status  137   
ru_wallclock 597650
ru_utime     595447.475
ru_stime     27.902
ru_maxrss    13814492
ru_ixrss     0     
ru_ismrss    0     
ru_idrss     0     
ru_isrss     0     
ru_minflt    46105 
ru_majflt    33    
ru_nswap     0     
ru_inblock   19736 
ru_oublock   160   
ru_msgsnd    0     
ru_msgrcv    0     
ru_nsignals  0     
ru_nvcsw     786   
ru_nivcsw    1520367
cpu          595475.377
mem          3938342.810
io           0.430 
iow          0.000 
maxvmem      13.302G
arid         undefined


Thoughts, at this point? We're really running out of ideas now [apart from
the most recent suggestion of a restart of the execd and queuemaster].

--JC




On 19/01/13 2:16 AM, "Dave Love" <[email protected]> wrote:

>Jake Carroll <[email protected]> writes:
>
>> Hi.
>>
>> Interesting. 
>>
>> From /opt/gridengine/default/spool/compute-0-4/messages, we are seeing
>> some unusual stuff (or, maybe it is entirely run of the mill?):
>
>I'm not sure whether it's the same as
>https://arc.liv.ac.uk/trac/SGE/ticket/1418, which I haven't tried to
>debug.  It might be relevant, what operating system it is.
>
>> 01/16/2013 18:05:50|  main|compute-0-4|W|reaping job "1350379" ptf
>> complains: Job does not exist
>> 01/16/2013 18:07:56|  main|compute-0-4|E|removing unreferenced job
>> 1350379.4111 without job report from ptf
>
>> At this point, we're scratching our heads and considering a reboot of
>>the
>> head node on Friday, as we really aren't understanding what is going
>>wrong
>> here.
>
>I'd restart the execd on the node, if anything, and possibly the
>qmaster.  I can't think rebooting the head would be useful.
>
>-- 
>Community Grid Engine:  http://arc.liv.ac.uk/SGE/


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error 137 - trying to figure out what it means.

Reply via email to