Mmh, was the terminate method redefined in the queue configuration of the queue in question?
Am 30.10.2012 um 23:04 schrieb Joseph Farran: > No, still no cigar. > > # cat /var/spool/ge/compute-12-22/messages | grep wall > # > > Here is what is strange. > > Some jobs do get killed just fine. One job that just went over the time > limit on another queue, GE killed it and here is the log: > > 10/30/2012 14:32:06| main|compute-1-7|I|registered at qmaster host > "hpc.local" > 10/30/2012 14:32:06| main|compute-1-7|I|Reconnected to qmaster - enabled > delayed job reporting period > 10/30/2012 14:42:04| main|compute-1-7|I|Delayed job reporting period finished > 10/30/2012 14:57:35| main|compute-1-7|W|job 12730.1 exceeded hard wallclock > time - initiate terminate method > 10/30/2012 14:57:36| main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 signal: > KILL > > > On 10/30/2012 03:00 PM, Reuti wrote: >> Sorry, should be like: >> >> 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard wallclock time >> - initiate terminate method >> >> >> Am 30.10.2012 um 22:57 schrieb Joseph Farran: >> >>> Did not have loglevel set to log_info, so I updated it, restarted GE on the >>> master and softstop and start on the compute node. >>> >>> I got a lot more log information now, but still no cigar: >>> >>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt >>> # >>> >>> Checked a few other compute nodes as well for the "h_rt" and nothing either. >>> >>> >>> >>> On 10/30/2012 01:49 PM, Reuti wrote: >>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >>>> >>>>> Here is one case: >>>>> >>>>> qstat| egrep "12959|12960" >>>>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>> [email protected] 1 >>>>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>> [email protected] 1 >>>>> >>>>> On compute-12-22: >>>>> >>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>> >>>>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>>>> 0 0 0 0 S \_ /bin/bash >>>>> /data/hpc/ge/load-sensor-cores-in-use.sh >>>>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>>>> 993 993 115 115 Ss | \_ -bash >>>>> /var/spool/ge/compute-12-22/job_scripts/12959 >>>>> 993 993 115 115 Rs | \_ ./pcharmm32 >>>>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>>>> 993 993 115 115 Ss \_ -bash >>>>> /var/spool/ge/compute-12-22/job_scripts/12960 >>>>> 993 993 115 115 Rs \_ ./pcharmm32 >>>>> >>>> Good, then: do you see any remark about the h_rt being exceeded in the >>>> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages >>>> >>>> I.e.: >>>> >>>> $ qconf -sconf >>>> ... >>>> loglevel log_info >>>> >>>> is set? >>>> >>>> -- Reuti >>>> >>>> >>>>> On 10/30/2012 12:07 PM, Reuti wrote: >>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>>>> >>>>>>> Hi Reuti. >>>>>>> >>>>>>> Yes, I had that already set: >>>>>>> >>>>>>> qconf -sconf|fgrep execd_params >>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>> >>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just fine >>>>>>> when they go past the hard wall time clock. >>>>>>> >>>>>>> However, the majority of the jobs are not being killed when they go >>>>>>> past their wall time clock. >>>>>>> >>>>>>> How can I investigate this further? >>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>> >>>>>> (f w/o -) and post the relevant lines of the application please. >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>>>> >>>>>>>>> I google this issue but did not see much help on the subject. >>>>>>>>> >>>>>>>>> I have several queues with hard wall clock limits like this one: >>>>>>>>> >>>>>>>>> # qconf -sq queue | grep h_rt >>>>>>>>> h_rt 96:00:00 >>>>>>>>> >>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard >>>>>>>>> wall clock limit and continue to run. >>>>>>>>> >>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries: >>>>>>>>> >>>>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished >>>>>>>>> since 42318s >>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed by >>>>>>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, >>>>>>>> which is attached to all started processes even if the executed >>>>>>>> something like `setsid`: >>>>>>>> >>>>>>>> $ qconf -sconf >>>>>>>> ... >>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>> >>>>>>>> If it's still not working, we have to investigate the process tree. >>>>>>>> >>>>>>>> HTH - Reuti >>>>>>>> >>>>>>>> >>>>>>>>> These entries correspond to the running jobs that should have ended >>>>>>>>> 96 hours ago, but they keep on running. >>>>>>>>> >>>>>>>>> Why is GE not killing these jobs correctly when they run past the 96 >>>>>>>>> hour limit but yet complains they should have ended? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
