Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 15:08:04 -0700

Mmh, was the terminate method redefined in the queue configuration of the queue 
in question?



Am 30.10.2012 um 23:04 schrieb Joseph Farran:

> No, still no cigar.
> 
> # cat  /var/spool/ge/compute-12-22/messages | grep wall
> #
> 
> Here is what is strange.
> 
> Some jobs do get killed just fine.   One job that just went over the time 
> limit on another queue, GE killed it and here is the log:
> 
> 10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host 
> "hpc.local"
> 10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - enabled 
> delayed job reporting period
> 10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period finished
> 10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard wallclock 
> time - initiate terminate method
> 10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 signal: 
> KILL
> 
> 
> On 10/30/2012 03:00 PM, Reuti wrote:
>> Sorry, should be like:
>> 
>> 10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard wallclock time 
>> - initiate terminate method
>> 
>> 
>> Am 30.10.2012 um 22:57 schrieb Joseph Farran:
>> 
>>> Did not have loglevel set to log_info, so I updated it, restarted GE on the 
>>> master and softstop and start on the compute node.
>>> 
>>> I got a lot more log information now, but still no cigar:
>>> 
>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
>>> #
>>> 
>>> Checked a few other compute nodes as well for the "h_rt" and nothing either.
>>> 
>>> 
>>> 
>>> On 10/30/2012 01:49 PM, Reuti wrote:
>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>>>> 
>>>>> Here is one case:
>>>>> 
>>>>> qstat| egrep "12959|12960"
>>>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>> [email protected]          1
>>>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>> [email protected]          1
>>>>> 
>>>>> On compute-12-22:
>>>>> 
>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>> 
>>>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>>>    0     0     0     0 S     \_ /bin/bash 
>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>>>  993   993   115   115 Ss    |   \_ -bash 
>>>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>>>  993   993   115   115 Ss        \_ -bash 
>>>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>>>> 
>>>> Good, then: do you see any remark about the h_rt being exceeded in the 
>>>> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages
>>>> 
>>>> I.e.:
>>>> 
>>>> $ qconf -sconf
>>>> ...
>>>> loglevel                     log_info
>>>> 
>>>> is set?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>>>> 
>>>>>>> Hi Reuti.
>>>>>>> 
>>>>>>> Yes, I had that already set:
>>>>>>> 
>>>>>>> qconf -sconf|fgrep execd_params
>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>> 
>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just fine 
>>>>>>> when they go past the hard wall time clock.
>>>>>>> 
>>>>>>> However, the majority of the jobs are not being killed when they go 
>>>>>>> past their wall time clock.
>>>>>>> 
>>>>>>> How can I investigate this further?
>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>> 
>>>>>> (f w/o -) and post the relevant lines of the application please.
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>>>> 
>>>>>>>>> I have several queues with hard wall clock limits like this one:
>>>>>>>>> 
>>>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>>>> h_rt                  96:00:00
>>>>>>>>> 
>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard 
>>>>>>>>> wall clock limit and continue to run.
>>>>>>>>> 
>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>>>>>>>> 
>>>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished 
>>>>>>>>> since 42318s
>>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed by 
>>>>>>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, 
>>>>>>>> which is attached to all started processes even if the executed 
>>>>>>>> something like `setsid`:
>>>>>>>> 
>>>>>>>> $ qconf -sconf
>>>>>>>> ...
>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>> 
>>>>>>>> If it's still not working, we have to investigate the process tree.
>>>>>>>> 
>>>>>>>> HTH - Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> These entries correspond to the running jobs that should have ended 
>>>>>>>>> 96 hours ago, but they keep on running.
>>>>>>>>> 
>>>>>>>>> Why is GE not killing these jobs correctly when they run past the 96 
>>>>>>>>> hour limit but yet complains they should have ended?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to