Am 31.10.2012 um 00:03 schrieb Joseph Farran: > The strace shows job running ok: doing work and then writing to a file. > > I was able to kill the jobs ( 1-core each ) just fine with "kill -9". > > Looking at the qmaster log after a few minutes said: > > 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1 > 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host > compute-12-22.local > 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1 > 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host > compute-12-22.local
Did you define s_rt and -notify too? -- Reuti > So GE cleared out the jobs ok. Not sure why the node sge is not killing > correctly. > > Oh well, thanks Reuti. I will keep playing with this... > > > > On 10/30/2012 03:53 PM, Reuti wrote: >> Am 30.10.2012 um 23:45 schrieb Joseph Farran: >> >>> No: >>> >>> # qconf -sq free2 | fgrep terminate >>> terminate_method NONE >> Is the process still doing something serious or hanging somewhere in a loop: >> >> $ strace -p 1234 >> >> and 1234 is the pid of the process on the node (you have to be root or owner >> of the process). >> >> Afterwards: is a kill -9 1234 by hand successful? >> >> -- Reuti >> >> >>> On 10/30/2012 03:07 PM, Reuti wrote: >>>> Mmh, was the terminate method redefined in the queue configuration of the >>>> queue in question? >>>> >>>> >>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran: >>>> >>>>> No, still no cigar. >>>>> >>>>> # cat /var/spool/ge/compute-12-22/messages | grep wall >>>>> # >>>>> >>>>> Here is what is strange. >>>>> >>>>> Some jobs do get killed just fine. One job that just went over the time >>>>> limit on another queue, GE killed it and here is the log: >>>>> >>>>> 10/30/2012 14:32:06| main|compute-1-7|I|registered at qmaster host >>>>> "hpc.local" >>>>> 10/30/2012 14:32:06| main|compute-1-7|I|Reconnected to qmaster - enabled >>>>> delayed job reporting period >>>>> 10/30/2012 14:42:04| main|compute-1-7|I|Delayed job reporting period >>>>> finished >>>>> 10/30/2012 14:57:35| main|compute-1-7|W|job 12730.1 exceeded hard >>>>> wallclock time - initiate terminate method >>>>> 10/30/2012 14:57:36| main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 >>>>> signal: KILL >>>>> >>>>> >>>>> On 10/30/2012 03:00 PM, Reuti wrote: >>>>>> Sorry, should be like: >>>>>> >>>>>> 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard wallclock >>>>>> time - initiate terminate method >>>>>> >>>>>> >>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran: >>>>>> >>>>>>> Did not have loglevel set to log_info, so I updated it, restarted GE on >>>>>>> the master and softstop and start on the compute node. >>>>>>> >>>>>>> I got a lot more log information now, but still no cigar: >>>>>>> >>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt >>>>>>> # >>>>>>> >>>>>>> Checked a few other compute nodes as well for the "h_rt" and nothing >>>>>>> either. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/30/2012 01:49 PM, Reuti wrote: >>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >>>>>>>> >>>>>>>>> Here is one case: >>>>>>>>> >>>>>>>>> qstat| egrep "12959|12960" >>>>>>>>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>> [email protected] 1 >>>>>>>>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>> [email protected] 1 >>>>>>>>> >>>>>>>>> On compute-12-22: >>>>>>>>> >>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command >>>>>>>>> --cols=500 >>>>>>>>> >>>>>>>>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>>>>>>>> 0 0 0 0 S \_ /bin/bash >>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh >>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>>>>>>>> 993 993 115 115 Ss | \_ -bash >>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959 >>>>>>>>> 993 993 115 115 Rs | \_ ./pcharmm32 >>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>>>>>>>> 993 993 115 115 Ss \_ -bash >>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960 >>>>>>>>> 993 993 115 115 Rs \_ ./pcharmm32 >>>>>>>>> >>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in the >>>>>>>> messages file of the host >>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages >>>>>>>> >>>>>>>> I.e.: >>>>>>>> >>>>>>>> $ qconf -sconf >>>>>>>> ... >>>>>>>> loglevel log_info >>>>>>>> >>>>>>>> is set? >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> >>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote: >>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>>>>>>>> >>>>>>>>>>> Hi Reuti. >>>>>>>>>>> >>>>>>>>>>> Yes, I had that already set: >>>>>>>>>>> >>>>>>>>>>> qconf -sconf|fgrep execd_params >>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>> >>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just >>>>>>>>>>> fine when they go past the hard wall time clock. >>>>>>>>>>> >>>>>>>>>>> However, the majority of the jobs are not being killed when they go >>>>>>>>>>> past their wall time clock. >>>>>>>>>>> >>>>>>>>>>> How can I investigate this further? >>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>>>>>> >>>>>>>>>> (f w/o -) and post the relevant lines of the application please. >>>>>>>>>> >>>>>>>>>> -- Reuti >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>>>>>>>> >>>>>>>>>>>>> I google this issue but did not see much help on the subject. >>>>>>>>>>>>> >>>>>>>>>>>>> I have several queues with hard wall clock limits like this one: >>>>>>>>>>>>> >>>>>>>>>>>>> # qconf -sq queue | grep h_rt >>>>>>>>>>>>> h_rt 96:00:00 >>>>>>>>>>>>> >>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the >>>>>>>>>>>>> hard wall clock limit and continue to run. >>>>>>>>>>>>> >>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries: >>>>>>>>>>>>> >>>>>>>>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have >>>>>>>>>>>>> finished since 42318s >>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are killed >>>>>>>>>>>> by `kill -9 -- -pgrp`. You can kill them by their additional group >>>>>>>>>>>> id, which is attached to all started processes even if the >>>>>>>>>>>> executed something like `setsid`: >>>>>>>>>>>> >>>>>>>>>>>> $ qconf -sconf >>>>>>>>>>>> ... >>>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>>> >>>>>>>>>>>> If it's still not working, we have to investigate the process tree. >>>>>>>>>>>> >>>>>>>>>>>> HTH - Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> These entries correspond to the running jobs that should have >>>>>>>>>>>>> ended 96 hours ago, but they keep on running. >>>>>>>>>>>>> >>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past the >>>>>>>>>>>>> 96 hour limit but yet complains they should have ended? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
