Am 30.10.2012 um 20:18 schrieb Joseph Farran:

> Here is one case:
> 
> qstat| egrep "12959|12960"
>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
> [email protected]          1
>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
> [email protected]          1
> 
> On compute-12-22:
> 
> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
> 
>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>    0     0     0     0 S     \_ /bin/bash 
> /data/hpc/ge/load-sensor-cores-in-use.sh
>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>  993   993   115   115 Ss    |   \_ -bash 
> /var/spool/ge/compute-12-22/job_scripts/12959
>  993   993   115   115 Rs    |       \_ ./pcharmm32
>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>  993   993   115   115 Ss        \_ -bash 
> /var/spool/ge/compute-12-22/job_scripts/12960
>  993   993   115   115 Rs            \_ ./pcharmm32
> 

Good, then: do you see any remark about the h_rt being exceeded in the messages 
file of the host $SGE_ROOT/default/spool/compute-12-22/messages

I.e.:

$ qconf -sconf
...
loglevel                     log_info

is set?

-- Reuti


> On 10/30/2012 12:07 PM, Reuti wrote:
>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>> 
>>> Hi Reuti.
>>> 
>>> Yes, I had that already set:
>>> 
>>> qconf -sconf|fgrep execd_params
>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>> 
>>> What is strange is that 1 out of 10 jobs or so do get killed just fine when 
>>> they go past the hard wall time clock.
>>> 
>>> However, the majority of the jobs are not being killed when they go past 
>>> their wall time clock.
>>> 
>>> How can I investigate this further?
>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>> 
>> (f w/o -) and post the relevant lines of the application please.
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>> Hi,
>>>> 
>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>> 
>>>>> I google this issue but did not see much help on the subject.
>>>>> 
>>>>> I have several queues with hard wall clock limits like this one:
>>>>> 
>>>>> # qconf -sq queue  | grep h_rt
>>>>> h_rt                  96:00:00
>>>>> 
>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard 
>>>>> wall clock limit and continue to run.
>>>>> 
>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>>>> 
>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished 
>>>>> since 42318s
>>>> Maybe they jumped out of the process tree (usually jobs are killed by 
>>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, which 
>>>> is attached to all started processes even if the executed something like 
>>>> `setsid`:
>>>> 
>>>> $ qconf -sconf
>>>> ...
>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>> 
>>>> If it's still not working, we have to investigate the process tree.
>>>> 
>>>> HTH - Reuti
>>>> 
>>>> 
>>>>> These entries correspond to the running jobs that should have ended 96 
>>>>> hours ago, but they keep on running.
>>>>> 
>>>>> Why is GE not killing these jobs correctly when they run past the 96 hour 
>>>>> limit but yet complains they should have ended?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to