Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 16:07:14 -0700

Am 31.10.2012 um 00:03 schrieb Joseph Farran:

> The strace shows job running ok:  doing work and then writing to a file.
> 
> I was able to kill the jobs ( 1-core each ) just fine with "kill -9".
> 
> Looking at the qmaster log after a few minutes said:
> 
> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1
> 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host 
> compute-12-22.local
> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1
> 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host 
> compute-12-22.local


Did you define s_rt and -notify too?

-- Reuti



> So GE cleared out the jobs ok.   Not sure why the node sge is not killing 
> correctly.
> 
> Oh well, thanks Reuti.   I will keep playing with this...
> 
> 
> 
> On 10/30/2012 03:53 PM, Reuti wrote:
>> Am 30.10.2012 um 23:45 schrieb Joseph Farran:
>> 
>>> No:
>>> 
>>> # qconf -sq free2 | fgrep terminate
>>> terminate_method      NONE
>> Is the process still doing something serious or hanging somewhere in a loop:
>> 
>> $ strace -p 1234
>> 
>> and 1234 is the pid of the process on the node (you have to be root or owner 
>> of the process).
>> 
>> Afterwards: is a kill -9 1234 by hand successful?
>> 
>> -- Reuti
>> 
>> 
>>> On 10/30/2012 03:07 PM, Reuti wrote:
>>>> Mmh, was the terminate method redefined in the queue configuration of the 
>>>> queue in question?
>>>> 
>>>> 
>>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran:
>>>> 
>>>>> No, still no cigar.
>>>>> 
>>>>> # cat  /var/spool/ge/compute-12-22/messages | grep wall
>>>>> #
>>>>> 
>>>>> Here is what is strange.
>>>>> 
>>>>> Some jobs do get killed just fine.   One job that just went over the time 
>>>>> limit on another queue, GE killed it and here is the log:
>>>>> 
>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host 
>>>>> "hpc.local"
>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - enabled 
>>>>> delayed job reporting period
>>>>> 10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period 
>>>>> finished
>>>>> 10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard 
>>>>> wallclock time - initiate terminate method
>>>>> 10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 
>>>>> signal: KILL
>>>>> 
>>>>> 
>>>>> On 10/30/2012 03:00 PM, Reuti wrote:
>>>>>> Sorry, should be like:
>>>>>> 
>>>>>> 10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard wallclock 
>>>>>> time - initiate terminate method
>>>>>> 
>>>>>> 
>>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran:
>>>>>> 
>>>>>>> Did not have loglevel set to log_info, so I updated it, restarted GE on 
>>>>>>> the master and softstop and start on the compute node.
>>>>>>> 
>>>>>>> I got a lot more log information now, but still no cigar:
>>>>>>> 
>>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
>>>>>>> #
>>>>>>> 
>>>>>>> Checked a few other compute nodes as well for the "h_rt" and nothing 
>>>>>>> either.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 10/30/2012 01:49 PM, Reuti wrote:
>>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> Here is one case:
>>>>>>>>> 
>>>>>>>>> qstat| egrep "12959|12960"
>>>>>>>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>> [email protected]          1
>>>>>>>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>> [email protected]          1
>>>>>>>>> 
>>>>>>>>> On compute-12-22:
>>>>>>>>> 
>>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command 
>>>>>>>>> --cols=500
>>>>>>>>> 
>>>>>>>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>>>>>>>    0     0     0     0 S     \_ /bin/bash 
>>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>>>>>>>  993   993   115   115 Ss    |   \_ -bash 
>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>>>>>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>>>>>>>  993   993   115   115 Ss        \_ -bash 
>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>>>>>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>>>>>>>> 
>>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in the 
>>>>>>>> messages file of the host 
>>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages
>>>>>>>> 
>>>>>>>> I.e.:
>>>>>>>> 
>>>>>>>> $ qconf -sconf
>>>>>>>> ...
>>>>>>>> loglevel                     log_info
>>>>>>>> 
>>>>>>>> is set?
>>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>>>>>>>> 
>>>>>>>>>>> Hi Reuti.
>>>>>>>>>>> 
>>>>>>>>>>> Yes, I had that already set:
>>>>>>>>>>> 
>>>>>>>>>>> qconf -sconf|fgrep execd_params
>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>> 
>>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just 
>>>>>>>>>>> fine when they go past the hard wall time clock.
>>>>>>>>>>> 
>>>>>>>>>>> However, the majority of the jobs are not being killed when they go 
>>>>>>>>>>> past their wall time clock.
>>>>>>>>>>> 
>>>>>>>>>>> How can I investigate this further?
>>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>>>>>> 
>>>>>>>>>> (f w/o -) and post the relevant lines of the application please.
>>>>>>>>>> 
>>>>>>>>>> -- Reuti
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have several queues with hard wall clock limits like this one:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>>>>>>>> h_rt                  96:00:00
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the 
>>>>>>>>>>>>> hard wall clock limit and continue to run.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have 
>>>>>>>>>>>>> finished since 42318s
>>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed 
>>>>>>>>>>>> by `kill -9 -- -pgrp`. You can kill them by their additional group 
>>>>>>>>>>>> id, which is attached to all started processes even if the 
>>>>>>>>>>>> executed something like `setsid`:
>>>>>>>>>>>> 
>>>>>>>>>>>> $ qconf -sconf
>>>>>>>>>>>> ...
>>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>>> 
>>>>>>>>>>>> If it's still not working, we have to investigate the process tree.
>>>>>>>>>>>> 
>>>>>>>>>>>> HTH - Reuti
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> These entries correspond to the running jobs that should have 
>>>>>>>>>>>>> ended 96 hours ago, but they keep on running.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past the 
>>>>>>>>>>>>> 96 hour limit but yet complains they should have ended?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to