Am 30.10.2012 um 20:18 schrieb Joseph Farran: > Here is one case: > > qstat| egrep "12959|12960" > 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 > [email protected] 1 > 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 > [email protected] 1 > > On compute-12-22: > > compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 > > 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd > 0 0 0 0 S \_ /bin/bash > /data/hpc/ge/load-sensor-cores-in-use.sh > 0 570 0 201 S \_ sge_shepherd-12959 -bg > 993 993 115 115 Ss | \_ -bash > /var/spool/ge/compute-12-22/job_scripts/12959 > 993 993 115 115 Rs | \_ ./pcharmm32 > 0 570 0 201 S \_ sge_shepherd-12960 -bg > 993 993 115 115 Ss \_ -bash > /var/spool/ge/compute-12-22/job_scripts/12960 > 993 993 115 115 Rs \_ ./pcharmm32 >
Good, then: do you see any remark about the h_rt being exceeded in the messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages I.e.: $ qconf -sconf ... loglevel log_info is set? -- Reuti > On 10/30/2012 12:07 PM, Reuti wrote: >> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >> >>> Hi Reuti. >>> >>> Yes, I had that already set: >>> >>> qconf -sconf|fgrep execd_params >>> execd_params ENABLE_ADDGRP_KILL=TRUE >>> >>> What is strange is that 1 out of 10 jobs or so do get killed just fine when >>> they go past the hard wall time clock. >>> >>> However, the majority of the jobs are not being killed when they go past >>> their wall time clock. >>> >>> How can I investigate this further? >> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >> >> (f w/o -) and post the relevant lines of the application please. >> >> -- Reuti >> >> >>> >>> On 10/30/2012 11:44 AM, Reuti wrote: >>>> Hi, >>>> >>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>> >>>>> I google this issue but did not see much help on the subject. >>>>> >>>>> I have several queues with hard wall clock limits like this one: >>>>> >>>>> # qconf -sq queue | grep h_rt >>>>> h_rt 96:00:00 >>>>> >>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard >>>>> wall clock limit and continue to run. >>>>> >>>>> Look at GE qmaster logs, I see dozens and dozens of these entries: >>>>> >>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished >>>>> since 42318s >>>> Maybe they jumped out of the process tree (usually jobs are killed by >>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, which >>>> is attached to all started processes even if the executed something like >>>> `setsid`: >>>> >>>> $ qconf -sconf >>>> ... >>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>> >>>> If it's still not working, we have to investigate the process tree. >>>> >>>> HTH - Reuti >>>> >>>> >>>>> These entries correspond to the running jobs that should have ended 96 >>>>> hours ago, but they keep on running. >>>>> >>>>> Why is GE not killing these jobs correctly when they run past the 96 hour >>>>> limit but yet complains they should have ended? >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
