Am 30.10.2012 um 23:45 schrieb Joseph Farran: > No: > > # qconf -sq free2 | fgrep terminate > terminate_method NONE
Is the process still doing something serious or hanging somewhere in a loop: $ strace -p 1234 and 1234 is the pid of the process on the node (you have to be root or owner of the process). Afterwards: is a kill -9 1234 by hand successful? -- Reuti > On 10/30/2012 03:07 PM, Reuti wrote: >> Mmh, was the terminate method redefined in the queue configuration of the >> queue in question? >> >> >> Am 30.10.2012 um 23:04 schrieb Joseph Farran: >> >>> No, still no cigar. >>> >>> # cat /var/spool/ge/compute-12-22/messages | grep wall >>> # >>> >>> Here is what is strange. >>> >>> Some jobs do get killed just fine. One job that just went over the time >>> limit on another queue, GE killed it and here is the log: >>> >>> 10/30/2012 14:32:06| main|compute-1-7|I|registered at qmaster host >>> "hpc.local" >>> 10/30/2012 14:32:06| main|compute-1-7|I|Reconnected to qmaster - enabled >>> delayed job reporting period >>> 10/30/2012 14:42:04| main|compute-1-7|I|Delayed job reporting period >>> finished >>> 10/30/2012 14:57:35| main|compute-1-7|W|job 12730.1 exceeded hard >>> wallclock time - initiate terminate method >>> 10/30/2012 14:57:36| main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 >>> signal: KILL >>> >>> >>> On 10/30/2012 03:00 PM, Reuti wrote: >>>> Sorry, should be like: >>>> >>>> 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard wallclock >>>> time - initiate terminate method >>>> >>>> >>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran: >>>> >>>>> Did not have loglevel set to log_info, so I updated it, restarted GE on >>>>> the master and softstop and start on the compute node. >>>>> >>>>> I got a lot more log information now, but still no cigar: >>>>> >>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt >>>>> # >>>>> >>>>> Checked a few other compute nodes as well for the "h_rt" and nothing >>>>> either. >>>>> >>>>> >>>>> >>>>> On 10/30/2012 01:49 PM, Reuti wrote: >>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >>>>>> >>>>>>> Here is one case: >>>>>>> >>>>>>> qstat| egrep "12959|12960" >>>>>>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>> [email protected] 1 >>>>>>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>> [email protected] 1 >>>>>>> >>>>>>> On compute-12-22: >>>>>>> >>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>>> >>>>>>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>>>>>> 0 0 0 0 S \_ /bin/bash >>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh >>>>>>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>>>>>> 993 993 115 115 Ss | \_ -bash >>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959 >>>>>>> 993 993 115 115 Rs | \_ ./pcharmm32 >>>>>>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>>>>>> 993 993 115 115 Ss \_ -bash >>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960 >>>>>>> 993 993 115 115 Rs \_ ./pcharmm32 >>>>>>> >>>>>> Good, then: do you see any remark about the h_rt being exceeded in the >>>>>> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages >>>>>> >>>>>> I.e.: >>>>>> >>>>>> $ qconf -sconf >>>>>> ... >>>>>> loglevel log_info >>>>>> >>>>>> is set? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> On 10/30/2012 12:07 PM, Reuti wrote: >>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>>>>>> >>>>>>>>> Hi Reuti. >>>>>>>>> >>>>>>>>> Yes, I had that already set: >>>>>>>>> >>>>>>>>> qconf -sconf|fgrep execd_params >>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>> >>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just >>>>>>>>> fine when they go past the hard wall time clock. >>>>>>>>> >>>>>>>>> However, the majority of the jobs are not being killed when they go >>>>>>>>> past their wall time clock. >>>>>>>>> >>>>>>>>> How can I investigate this further? >>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>>>> >>>>>>>> (f w/o -) and post the relevant lines of the application please. >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> >>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>>>>>> >>>>>>>>>>> I google this issue but did not see much help on the subject. >>>>>>>>>>> >>>>>>>>>>> I have several queues with hard wall clock limits like this one: >>>>>>>>>>> >>>>>>>>>>> # qconf -sq queue | grep h_rt >>>>>>>>>>> h_rt 96:00:00 >>>>>>>>>>> >>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the >>>>>>>>>>> hard wall clock limit and continue to run. >>>>>>>>>>> >>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries: >>>>>>>>>>> >>>>>>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have >>>>>>>>>>> finished since 42318s >>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed >>>>>>>>>> by `kill -9 -- -pgrp`. You can kill them by their additional group >>>>>>>>>> id, which is attached to all started processes even if the executed >>>>>>>>>> something like `setsid`: >>>>>>>>>> >>>>>>>>>> $ qconf -sconf >>>>>>>>>> ... >>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>> >>>>>>>>>> If it's still not working, we have to investigate the process tree. >>>>>>>>>> >>>>>>>>>> HTH - Reuti >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> These entries correspond to the running jobs that should have ended >>>>>>>>>>> 96 hours ago, but they keep on running. >>>>>>>>>>> >>>>>>>>>>> Why is GE not killing these jobs correctly when they run past the >>>>>>>>>>> 96 hour limit but yet complains they should have ended? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
