Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 15:54:34 -0700

Am 30.10.2012 um 23:45 schrieb Joseph Farran:

> No:
> 
> # qconf -sq free2 | fgrep terminate
> terminate_method      NONE


Is the process still doing something serious or hanging somewhere in a loop:

$ strace -p 1234

and 1234 is the pid of the process on the node (you have to be root or owner of 
the process).

Afterwards: is a kill -9 1234 by hand successful?

-- Reuti


> On 10/30/2012 03:07 PM, Reuti wrote:
>> Mmh, was the terminate method redefined in the queue configuration of the 
>> queue in question?
>> 
>> 
>> Am 30.10.2012 um 23:04 schrieb Joseph Farran:
>> 
>>> No, still no cigar.
>>> 
>>> # cat  /var/spool/ge/compute-12-22/messages | grep wall
>>> #
>>> 
>>> Here is what is strange.
>>> 
>>> Some jobs do get killed just fine.   One job that just went over the time 
>>> limit on another queue, GE killed it and here is the log:
>>> 
>>> 10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host 
>>> "hpc.local"
>>> 10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - enabled 
>>> delayed job reporting period
>>> 10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period 
>>> finished
>>> 10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard 
>>> wallclock time - initiate terminate method
>>> 10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 
>>> signal: KILL
>>> 
>>> 
>>> On 10/30/2012 03:00 PM, Reuti wrote:
>>>> Sorry, should be like:
>>>> 
>>>> 10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard wallclock 
>>>> time - initiate terminate method
>>>> 
>>>> 
>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran:
>>>> 
>>>>> Did not have loglevel set to log_info, so I updated it, restarted GE on 
>>>>> the master and softstop and start on the compute node.
>>>>> 
>>>>> I got a lot more log information now, but still no cigar:
>>>>> 
>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
>>>>> #
>>>>> 
>>>>> Checked a few other compute nodes as well for the "h_rt" and nothing 
>>>>> either.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/30/2012 01:49 PM, Reuti wrote:
>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>>>>>> 
>>>>>>> Here is one case:
>>>>>>> 
>>>>>>> qstat| egrep "12959|12960"
>>>>>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>> [email protected]          1
>>>>>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>> [email protected]          1
>>>>>>> 
>>>>>>> On compute-12-22:
>>>>>>> 
>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>>> 
>>>>>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>>>>>    0     0     0     0 S     \_ /bin/bash 
>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>>>>>  993   993   115   115 Ss    |   \_ -bash 
>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>>>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>>>>>  993   993   115   115 Ss        \_ -bash 
>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>>>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>>>>>> 
>>>>>> Good, then: do you see any remark about the h_rt being exceeded in the 
>>>>>> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages
>>>>>> 
>>>>>> I.e.:
>>>>>> 
>>>>>> $ qconf -sconf
>>>>>> ...
>>>>>> loglevel                     log_info
>>>>>> 
>>>>>> is set?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> Hi Reuti.
>>>>>>>>> 
>>>>>>>>> Yes, I had that already set:
>>>>>>>>> 
>>>>>>>>> qconf -sconf|fgrep execd_params
>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>> 
>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just 
>>>>>>>>> fine when they go past the hard wall time clock.
>>>>>>>>> 
>>>>>>>>> However, the majority of the jobs are not being killed when they go 
>>>>>>>>> past their wall time clock.
>>>>>>>>> 
>>>>>>>>> How can I investigate this further?
>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>>>> 
>>>>>>>> (f w/o -) and post the relevant lines of the application please.
>>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>>>>>> 
>>>>>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>>>>>> 
>>>>>>>>>>> I have several queues with hard wall clock limits like this one:
>>>>>>>>>>> 
>>>>>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>>>>>> h_rt                  96:00:00
>>>>>>>>>>> 
>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the 
>>>>>>>>>>> hard wall clock limit and continue to run.
>>>>>>>>>>> 
>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>>>>>>>>>> 
>>>>>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have 
>>>>>>>>>>> finished since 42318s
>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed 
>>>>>>>>>> by `kill -9 -- -pgrp`. You can kill them by their additional group 
>>>>>>>>>> id, which is attached to all started processes even if the executed 
>>>>>>>>>> something like `setsid`:
>>>>>>>>>> 
>>>>>>>>>> $ qconf -sconf
>>>>>>>>>> ...
>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>> 
>>>>>>>>>> If it's still not working, we have to investigate the process tree.
>>>>>>>>>> 
>>>>>>>>>> HTH - Reuti
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> These entries correspond to the running jobs that should have ended 
>>>>>>>>>>> 96 hours ago, but they keep on running.
>>>>>>>>>>> 
>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past the 
>>>>>>>>>>> 96 hour limit but yet complains they should have ended?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to