Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 16:33:18 -0700

Am 31.10.2012 um 00:30 schrieb Joseph Farran:

> Looking at one of the other running job (that should have ended by now), I 
> don't see the notify:
> 
> # cat /var/spool/ge/qmaster/job_scripts/12923 | fgrep notify
> 
> # qstat| grep 12923
>  12923 0.50500 dna.pmf_15 amentes      r     10/24/2012 18:59:08 
> [email protected]          1


It can be requested on the command line or any of the sge_request files. Is it 
in:

$ qstat -j 12923

-- Reuti



> On 10/30/2012 04:18 PM, Reuti wrote:
>> Am 31.10.2012 um 00:13 schrieb Joseph Farran:
>> 
>>> At first, I only had the hard wall clock "h_rt", but a while ago I also 
>>> added the soft one:
>>> 
>>> Here are all of the related fields:
>>> 
>>> # qconf -sq free2 | egrep "rt|notify|terminate"
>>> shell_start_mode      posix_compliant
>>> starter_method        NONE
>>> terminate_method      NONE
>>> notify                00:00:60
>>> s_rt                  96:00:00
>>> h_rt                  96:00:00
>>> 
>>> Notify is set to 60, but I don't know what this does.
>> Were they also submitted with -notify? There was (is) an issue if both 
>> warnings by s_rt and -notify are requested. The warning to the job are send 
>> every 90 seconds but it's never getting killed.
>> 
>> -- Reuti
>> 
>> 
>>> On 10/30/2012 04:06 PM, Reuti wrote:
>>>> Am 31.10.2012 um 00:03 schrieb Joseph Farran:
>>>> 
>>>>> The strace shows job running ok:  doing work and then writing to a file.
>>>>> 
>>>>> I was able to kill the jobs ( 1-core each ) just fine with "kill -9".
>>>>> 
>>>>> Looking at the qmaster log after a few minutes said:
>>>>> 
>>>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1
>>>>> 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host 
>>>>> compute-12-22.local
>>>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1
>>>>> 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host 
>>>>> compute-12-22.local
>>>> Did you define s_rt and -notify too?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>> 
>>>>> So GE cleared out the jobs ok.   Not sure why the node sge is not killing 
>>>>> correctly.
>>>>> 
>>>>> Oh well, thanks Reuti.   I will keep playing with this...
>>>>> 
>>>>> 
>>>>> 
>>>>> On 10/30/2012 03:53 PM, Reuti wrote:
>>>>>> Am 30.10.2012 um 23:45 schrieb Joseph Farran:
>>>>>> 
>>>>>>> No:
>>>>>>> 
>>>>>>> # qconf -sq free2 | fgrep terminate
>>>>>>> terminate_method      NONE
>>>>>> Is the process still doing something serious or hanging somewhere in a 
>>>>>> loop:
>>>>>> 
>>>>>> $ strace -p 1234
>>>>>> 
>>>>>> and 1234 is the pid of the process on the node (you have to be root or 
>>>>>> owner of the process).
>>>>>> 
>>>>>> Afterwards: is a kill -9 1234 by hand successful?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> On 10/30/2012 03:07 PM, Reuti wrote:
>>>>>>>> Mmh, was the terminate method redefined in the queue configuration of 
>>>>>>>> the queue in question?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran:
>>>>>>>> 
>>>>>>>>> No, still no cigar.
>>>>>>>>> 
>>>>>>>>> # cat  /var/spool/ge/compute-12-22/messages | grep wall
>>>>>>>>> #
>>>>>>>>> 
>>>>>>>>> Here is what is strange.
>>>>>>>>> 
>>>>>>>>> Some jobs do get killed just fine.   One job that just went over the 
>>>>>>>>> time limit on another queue, GE killed it and here is the log:
>>>>>>>>> 
>>>>>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host 
>>>>>>>>> "hpc.local"
>>>>>>>>> 10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - 
>>>>>>>>> enabled delayed job reporting period
>>>>>>>>> 10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period 
>>>>>>>>> finished
>>>>>>>>> 10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard 
>>>>>>>>> wallclock time - initiate terminate method
>>>>>>>>> 10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 
>>>>>>>>> signal: KILL
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 10/30/2012 03:00 PM, Reuti wrote:
>>>>>>>>>> Sorry, should be like:
>>>>>>>>>> 
>>>>>>>>>> 10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard 
>>>>>>>>>> wallclock time - initiate terminate method
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran:
>>>>>>>>>> 
>>>>>>>>>>> Did not have loglevel set to log_info, so I updated it, restarted 
>>>>>>>>>>> GE on the master and softstop and start on the compute node.
>>>>>>>>>>> 
>>>>>>>>>>> I got a lot more log information now, but still no cigar:
>>>>>>>>>>> 
>>>>>>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
>>>>>>>>>>> #
>>>>>>>>>>> 
>>>>>>>>>>> Checked a few other compute nodes as well for the "h_rt" and 
>>>>>>>>>>> nothing either.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 10/30/2012 01:49 PM, Reuti wrote:
>>>>>>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Here is one case:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> qstat| egrep "12959|12960"
>>>>>>>>>>>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>>>>>> [email protected]          1
>>>>>>>>>>>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>>>>>>>>>>>> [email protected]          1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On compute-12-22:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command 
>>>>>>>>>>>>> --cols=500
>>>>>>>>>>>>> 
>>>>>>>>>>>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>>>>>>>>>>>    0     0     0     0 S     \_ /bin/bash 
>>>>>>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>>>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>>>>>>>>>>>  993   993   115   115 Ss    |   \_ -bash 
>>>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>>>>>>>>>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>>>>>>>>>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>>>>>>>>>>>  993   993   115   115 Ss        \_ -bash 
>>>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>>>>>>>>>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>>>>>>>>>>>> 
>>>>>>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in 
>>>>>>>>>>>> the messages file of the host 
>>>>>>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages
>>>>>>>>>>>> 
>>>>>>>>>>>> I.e.:
>>>>>>>>>>>> 
>>>>>>>>>>>> $ qconf -sconf
>>>>>>>>>>>> ...
>>>>>>>>>>>> loglevel                     log_info
>>>>>>>>>>>> 
>>>>>>>>>>>> is set?
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>>>>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi Reuti.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I had that already set:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> qconf -sconf|fgrep execd_params
>>>>>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed 
>>>>>>>>>>>>>>> just fine when they go past the hard wall time clock.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> However, the majority of the jobs are not being killed when 
>>>>>>>>>>>>>>> they go past their wall time clock.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> How can I investigate this further?
>>>>>>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (f w/o -) and post the relevant lines of the application please.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- Reuti
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I have several queues with hard wall clock limits like this 
>>>>>>>>>>>>>>>>> one:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>>>>>>>>>>>> h_rt                  96:00:00
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past 
>>>>>>>>>>>>>>>>> the hard wall clock limit and continue to run.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these 
>>>>>>>>>>>>>>>>> entries:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have 
>>>>>>>>>>>>>>>>> finished since 42318s
>>>>>>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are 
>>>>>>>>>>>>>>>> killed by `kill -9 -- -pgrp`. You can kill them by their 
>>>>>>>>>>>>>>>> additional group id, which is attached to all started 
>>>>>>>>>>>>>>>> processes even if the executed something like `setsid`:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> $ qconf -sconf
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If it's still not working, we have to investigate the process 
>>>>>>>>>>>>>>>> tree.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> HTH - Reuti
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> These entries correspond to the running jobs that should have 
>>>>>>>>>>>>>>>>> ended 96 hours ago, but they keep on running.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past 
>>>>>>>>>>>>>>>>> the 96 hour limit but yet complains they should have ended?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to