Sorry, should be like: 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard wallclock time - initiate terminate method
Am 30.10.2012 um 22:57 schrieb Joseph Farran: > Did not have loglevel set to log_info, so I updated it, restarted GE on the > master and softstop and start on the compute node. > > I got a lot more log information now, but still no cigar: > > # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt > # > > Checked a few other compute nodes as well for the "h_rt" and nothing either. > > > > On 10/30/2012 01:49 PM, Reuti wrote: >> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >> >>> Here is one case: >>> >>> qstat| egrep "12959|12960" >>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>> [email protected] 1 >>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>> [email protected] 1 >>> >>> On compute-12-22: >>> >>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>> >>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>> 0 0 0 0 S \_ /bin/bash >>> /data/hpc/ge/load-sensor-cores-in-use.sh >>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>> 993 993 115 115 Ss | \_ -bash >>> /var/spool/ge/compute-12-22/job_scripts/12959 >>> 993 993 115 115 Rs | \_ ./pcharmm32 >>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>> 993 993 115 115 Ss \_ -bash >>> /var/spool/ge/compute-12-22/job_scripts/12960 >>> 993 993 115 115 Rs \_ ./pcharmm32 >>> >> Good, then: do you see any remark about the h_rt being exceeded in the >> messages file of the host $SGE_ROOT/default/spool/compute-12-22/messages >> >> I.e.: >> >> $ qconf -sconf >> ... >> loglevel log_info >> >> is set? >> >> -- Reuti >> >> >>> On 10/30/2012 12:07 PM, Reuti wrote: >>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>> >>>>> Hi Reuti. >>>>> >>>>> Yes, I had that already set: >>>>> >>>>> qconf -sconf|fgrep execd_params >>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>> >>>>> What is strange is that 1 out of 10 jobs or so do get killed just fine >>>>> when they go past the hard wall time clock. >>>>> >>>>> However, the majority of the jobs are not being killed when they go past >>>>> their wall time clock. >>>>> >>>>> How can I investigate this further? >>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>> >>>> (f w/o -) and post the relevant lines of the application please. >>>> >>>> -- Reuti >>>> >>>> >>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>> Hi, >>>>>> >>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>> >>>>>>> I google this issue but did not see much help on the subject. >>>>>>> >>>>>>> I have several queues with hard wall clock limits like this one: >>>>>>> >>>>>>> # qconf -sq queue | grep h_rt >>>>>>> h_rt 96:00:00 >>>>>>> >>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past the hard >>>>>>> wall clock limit and continue to run. >>>>>>> >>>>>>> Look at GE qmaster logs, I see dozens and dozens of these entries: >>>>>>> >>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished >>>>>>> since 42318s >>>>>> Maybe they jumped out of the process tree (usually jobs are killed by >>>>>> `kill -9 -- -pgrp`. You can kill them by their additional group id, >>>>>> which is attached to all started processes even if the executed >>>>>> something like `setsid`: >>>>>> >>>>>> $ qconf -sconf >>>>>> ... >>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>> >>>>>> If it's still not working, we have to investigate the process tree. >>>>>> >>>>>> HTH - Reuti >>>>>> >>>>>> >>>>>>> These entries correspond to the running jobs that should have ended 96 >>>>>>> hours ago, but they keep on running. >>>>>>> >>>>>>> Why is GE not killing these jobs correctly when they run past the 96 >>>>>>> hour limit but yet complains they should have ended? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> [email protected] >>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
