Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reuti Tue, 30 Oct 2012 15:02:01 -0700

Sorry, should be like:

10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard wallclock time - 
initiate terminate method



Am 30.10.2012 um 22:57 schrieb Joseph Farran:

> Did not have loglevel set to log_info, so I updated it, restarted GE on the 
> master and softstop and start on the compute node.
> 
> I got a lot more log information now, but still no cigar:
> 
> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
> #
> 
> Checked a few other compute nodes as well for the "h_rt" and nothing either.
> 
> 
> 
> On 10/30/2012 01:49 PM, Reuti wrote:
>> Am 30.10.2012 um 20:18 schrieb Joseph Farran:
>> 
>>> Here is one case:
>>> 
>>> qstat| egrep "12959|12960"
>>>  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>> [email protected]          1
>>>  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
>>> [email protected]          1
>>> 
>>> On compute-12-22:
>>> 
>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>> 
>>>    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
>>>    0     0     0     0 S     \_ /bin/bash 
>>> /data/hpc/ge/load-sensor-cores-in-use.sh
>>>    0   570     0   201 S     \_ sge_shepherd-12959 -bg
>>>  993   993   115   115 Ss    |   \_ -bash 
>>> /var/spool/ge/compute-12-22/job_scripts/12959
>>>  993   993   115   115 Rs    |       \_ ./pcharmm32
>>>    0   570     0   201 S     \_ sge_shepherd-12960 -bg
>>>  993   993   115   115 Ss        \_ -bash 
>>> /var/spool/ge/compute-12-22/job_scripts/12960
>>>  993   993   115   115 Rs            \_ ./pcharmm32
>>> 
>> Good, then: do you see any remark about the h_rt being exceeded in the 
>> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages
>> 
>> I.e.:
>> 
>> $ qconf -sconf
>> ...
>> loglevel                     log_info
>> 
>> is set?
>> 
>> -- Reuti
>> 
>> 
>>> On 10/30/2012 12:07 PM, Reuti wrote:
>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran:
>>>> 
>>>>> Hi Reuti.
>>>>> 
>>>>> Yes, I had that already set:
>>>>> 
>>>>> qconf -sconf|fgrep execd_params
>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>> 
>>>>> What is strange is that 1 out of 10 jobs or so do get killed just fine 
>>>>> when they go past the hard wall time clock.
>>>>> 
>>>>> However, the majority of the jobs are not being killed when they go past 
>>>>> their wall time clock.
>>>>> 
>>>>> How can I investigate this further?
>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500
>>>> 
>>>> (f w/o -) and post the relevant lines of the application please.
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> On 10/30/2012 11:44 AM, Reuti wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran:
>>>>>> 
>>>>>>> I google this issue but did not see much help on the subject.
>>>>>>> 
>>>>>>> I have several queues with hard wall clock limits like this one:
>>>>>>> 
>>>>>>> # qconf -sq queue  | grep h_rt
>>>>>>> h_rt                  96:00:00
>>>>>>> 
>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard 
>>>>>>> wall clock limit and continue to run.
>>>>>>> 
>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries:
>>>>>>> 
>>>>>>>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished 
>>>>>>> since 42318s
>>>>>> Maybe they jumped out of the process tree (usually jobs are killed by 
>>>>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, 
>>>>>> which is attached to all started processes even if the executed 
>>>>>> something like `setsid`:
>>>>>> 
>>>>>> $ qconf -sconf
>>>>>> ...
>>>>>> execd_params                 ENABLE_ADDGRP_KILL=TRUE
>>>>>> 
>>>>>> If it's still not working, we have to investigate the process tree.
>>>>>> 
>>>>>> HTH - Reuti
>>>>>> 
>>>>>> 
>>>>>>> These entries correspond to the running jobs that should have ended 96 
>>>>>>> hours ago, but they keep on running.
>>>>>>> 
>>>>>>> Why is GE not killing these jobs correctly when they run past the 96 
>>>>>>> hour limit but yet complains they should have ended?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> [email protected]
>>>>>>> https://gridengine.org/mailman/listinfo/users
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to