Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 16:19:51 -0700

Am 31.10.2012 um 00:13 schrieb Joseph Farran:

> At first, I only had the hard wall clock "h_rt", but a while ago I also added 
> the soft one:
> 
> Here are all of the related fields:
> 
> # qconf -sq free2 | egrep "rt|notify|terminate"
> shell_start_mode      posix_compliant
> starter_method        NONE
> terminate_method      NONE
> notify                00:00:60
> s_rt                  96:00:00
> h_rt                  96:00:00
> 
> Notify is set to 60, but I don't know what this does.


Were they also submitted with -notify? There was (is) an issue if both warnings 
by s_rt and -notify are requested. The warning to the job are send every 90 
seconds but it's never getting killed.

-- Reuti


> On 10/30/2012 04:06 PM, Reuti wrote:
>> Am 31.10.2012 um 00:03 schrieb Joseph Farran:
>> 
>>> The strace shows job running ok:  doing work and then writing to a file.
>>> 
>>> I was able to kill the jobs ( 1-core each ) just fine with "kill -9".
>>> 
>>> Looking at the qmaster log after a few minutes said:
>>> 
>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1
>>> 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host 
>>> compute-12-22.local
>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1
>>> 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host 
>>> compute-12-22.local
>> Did you define s_rt and -notify too?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> So GE cleared out the jobs ok.   Not sure why the node sge is not killing 
>>> correctly.
>>> 
>>> Oh well, thanks Reuti.   I will keep playing with this...
>>> 
>>> 
>>> 
>>> On 10/30/2012 03:53 PM, Reuti wrote:
>>>> Am 30.10.2012 um 23:45 schrieb Joseph Farran:
>>>> 
>>>>> No:
>>>>> 
>>>>> # qconf -sq free2 | fgrep terminate
>>>>> terminate_method      NONE
>>>> Is the process still doing something serious or hanging somewhere in a 
>>>> loop:
>>>> 
>>>> $ strace -p 1234
>>>> 
>>>> and 1234 is the pid of the process on the node (you have to be root or 
>>>> owner of the process).
>>>> 
>>>> Afterwards: is a kill -9 1234 by hand successful?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> On 10/30/2012 03:07 PM, Reuti wrote:
>>>>>> Mmh, was the terminate method redefined in the queue configuration of 
>>>>>> the queue in question?
>>>>>> 
>>>>>> 
>>>>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran:
>>>>>> 
>>>>>>> No, still no cigar.
>>>>>>> 
>>>>>>> # cat  /var/spool/ge/compute-12-22/messages | grep wall
>>>>>>> #
>>>>>>> 
>>>>>>> Here is what is strange.
>>>>>>> 
>>>>>>> Some jobs do get killed just fine.   One job that just went over the 
>>>>>>> time limit on another queue, GE killed it and here is the log:
>>>>>>> 
>>>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host 
>>>>>>> "hpc.local"
>>>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - 
>>>>>>> enabled delayed job reporting period
>>>>>>> 10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period 
>>>>>>> finished
>>>>>>> 10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard 
>>>>>>> wallclock time - initiate terminate method
>>>>>>> 10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 
>>>>>>> signal: KILL
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/30/2012 03:00 PM, Reuti wrote:
>>>>>>>> Sorry, should be like:
>>>>>>>> 
>>>>>>>> 10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard 
>>>>>>>> wallclock time - initiate terminate method
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> Did not have loglevel set to log_info, so I updated it, restarted GE 
>>>>>>>>> on the master and softstop and start on the compute node.
>>>>>>>>> 
>>>>>>>>> I got a lot more log information now, but still no cigar:
>>>>>>>>> 
>>>>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
>>>>>>>>> #
>>>>>>>>> 
>>>>>>>>> Checked a few other compute nodes as well for the "h_rt" and nothing 
>>>>>>>>> either.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/30/2012 01:49 PM, Reuti wrote:
>>>>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>>>>>>>>>> 
>>>>>>>>>>> Here is one case:
>>>>>>>>>>> 
>>>>>>>>>>> qstat| egrep "12959|12960"
>>>>>>>>>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>>>> [email protected]          1
>>>>>>>>>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>>>> [email protected]          1
>>>>>>>>>>> 
>>>>>>>>>>> On compute-12-22:
>>>>>>>>>>> 
>>>>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command 
>>>>>>>>>>> --cols=500
>>>>>>>>>>> 
>>>>>>>>>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>>>>>>>>>    0     0     0     0 S     \_ /bin/bash 
>>>>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>>>>>>>>>  993   993   115   115 Ss    |   \_ -bash 
>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>>>>>>>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>>>>>>>>>  993   993   115   115 Ss        \_ -bash 
>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>>>>>>>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>>>>>>>>>> 
>>>>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in 
>>>>>>>>>> the messages file of the host 
>>>>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages
>>>>>>>>>> 
>>>>>>>>>> I.e.:
>>>>>>>>>> 
>>>>>>>>>> $ qconf -sconf
>>>>>>>>>> ...
>>>>>>>>>> loglevel                     log_info
>>>>>>>>>> 
>>>>>>>>>> is set?
>>>>>>>>>> 
>>>>>>>>>> -- Reuti
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Reuti.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, I had that already set:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> qconf -sconf|fgrep execd_params
>>>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just 
>>>>>>>>>>>>> fine when they go past the hard wall time clock.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, the majority of the jobs are not being killed when they 
>>>>>>>>>>>>> go past their wall time clock.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> How can I investigate this further?
>>>>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>>>>>>>> 
>>>>>>>>>>>> (f w/o -) and post the relevant lines of the application please.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I have several queues with hard wall clock limits like this one:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>>>>>>>>>> h_rt                  96:00:00
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past 
>>>>>>>>>>>>>>> the hard wall clock limit and continue to run.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these 
>>>>>>>>>>>>>>> entries:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have 
>>>>>>>>>>>>>>> finished since 42318s
>>>>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are 
>>>>>>>>>>>>>> killed by `kill -9 -- -pgrp`. You can kill them by their 
>>>>>>>>>>>>>> additional group id, which is attached to all started processes 
>>>>>>>>>>>>>> even if the executed something like `setsid`:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> $ qconf -sconf
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If it's still not working, we have to investigate the process 
>>>>>>>>>>>>>> tree.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> HTH - Reuti
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> These entries correspond to the running jobs that should have 
>>>>>>>>>>>>>>> ended 96 hours ago, but they keep on running.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past 
>>>>>>>>>>>>>>> the 96 hour limit but yet complains they should have ended?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to