Am 31.10.2012 um 00:30 schrieb Joseph Farran: > Looking at one of the other running job (that should have ended by now), I > don't see the notify: > > # cat /var/spool/ge/qmaster/job_scripts/12923 | fgrep notify > > # qstat| grep 12923 > 12923 0.50500 dna.pmf_15 amentes r 10/24/2012 18:59:08 > [email protected] 1
It can be requested on the command line or any of the sge_request files. Is it in: $ qstat -j 12923 -- Reuti > On 10/30/2012 04:18 PM, Reuti wrote: >> Am 31.10.2012 um 00:13 schrieb Joseph Farran: >> >>> At first, I only had the hard wall clock "h_rt", but a while ago I also >>> added the soft one: >>> >>> Here are all of the related fields: >>> >>> # qconf -sq free2 | egrep "rt|notify|terminate" >>> shell_start_mode posix_compliant >>> starter_method NONE >>> terminate_method NONE >>> notify 00:00:60 >>> s_rt 96:00:00 >>> h_rt 96:00:00 >>> >>> Notify is set to 60, but I don't know what this does. >> Were they also submitted with -notify? There was (is) an issue if both >> warnings by s_rt and -notify are requested. The warning to the job are send >> every 90 seconds but it's never getting killed. >> >> -- Reuti >> >> >>> On 10/30/2012 04:06 PM, Reuti wrote: >>>> Am 31.10.2012 um 00:03 schrieb Joseph Farran: >>>> >>>>> The strace shows job running ok: doing work and then writing to a file. >>>>> >>>>> I was able to kill the jobs ( 1-core each ) just fine with "kill -9". >>>>> >>>>> Looking at the qmaster log after a few minutes said: >>>>> >>>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1 >>>>> 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host >>>>> compute-12-22.local >>>>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1 >>>>> 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host >>>>> compute-12-22.local >>>> Did you define s_rt and -notify too? >>>> >>>> -- Reuti >>>> >>>> >>>> >>>>> So GE cleared out the jobs ok. Not sure why the node sge is not killing >>>>> correctly. >>>>> >>>>> Oh well, thanks Reuti. I will keep playing with this... >>>>> >>>>> >>>>> >>>>> On 10/30/2012 03:53 PM, Reuti wrote: >>>>>> Am 30.10.2012 um 23:45 schrieb Joseph Farran: >>>>>> >>>>>>> No: >>>>>>> >>>>>>> # qconf -sq free2 | fgrep terminate >>>>>>> terminate_method NONE >>>>>> Is the process still doing something serious or hanging somewhere in a >>>>>> loop: >>>>>> >>>>>> $ strace -p 1234 >>>>>> >>>>>> and 1234 is the pid of the process on the node (you have to be root or >>>>>> owner of the process). >>>>>> >>>>>> Afterwards: is a kill -9 1234 by hand successful? >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>>> On 10/30/2012 03:07 PM, Reuti wrote: >>>>>>>> Mmh, was the terminate method redefined in the queue configuration of >>>>>>>> the queue in question? >>>>>>>> >>>>>>>> >>>>>>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran: >>>>>>>> >>>>>>>>> No, still no cigar. >>>>>>>>> >>>>>>>>> # cat /var/spool/ge/compute-12-22/messages | grep wall >>>>>>>>> # >>>>>>>>> >>>>>>>>> Here is what is strange. >>>>>>>>> >>>>>>>>> Some jobs do get killed just fine. One job that just went over the >>>>>>>>> time limit on another queue, GE killed it and here is the log: >>>>>>>>> >>>>>>>>> 10/30/2012 14:32:06| main|compute-1-7|I|registered at qmaster host >>>>>>>>> "hpc.local" >>>>>>>>> 10/30/2012 14:32:06| main|compute-1-7|I|Reconnected to qmaster - >>>>>>>>> enabled delayed job reporting period >>>>>>>>> 10/30/2012 14:42:04| main|compute-1-7|I|Delayed job reporting period >>>>>>>>> finished >>>>>>>>> 10/30/2012 14:57:35| main|compute-1-7|W|job 12730.1 exceeded hard >>>>>>>>> wallclock time - initiate terminate method >>>>>>>>> 10/30/2012 14:57:36| main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 >>>>>>>>> signal: KILL >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/30/2012 03:00 PM, Reuti wrote: >>>>>>>>>> Sorry, should be like: >>>>>>>>>> >>>>>>>>>> 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard >>>>>>>>>> wallclock time - initiate terminate method >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran: >>>>>>>>>> >>>>>>>>>>> Did not have loglevel set to log_info, so I updated it, restarted >>>>>>>>>>> GE on the master and softstop and start on the compute node. >>>>>>>>>>> >>>>>>>>>>> I got a lot more log information now, but still no cigar: >>>>>>>>>>> >>>>>>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt >>>>>>>>>>> # >>>>>>>>>>> >>>>>>>>>>> Checked a few other compute nodes as well for the "h_rt" and >>>>>>>>>>> nothing either. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 10/30/2012 01:49 PM, Reuti wrote: >>>>>>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >>>>>>>>>>>> >>>>>>>>>>>>> Here is one case: >>>>>>>>>>>>> >>>>>>>>>>>>> qstat| egrep "12959|12960" >>>>>>>>>>>>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>>>>>> [email protected] 1 >>>>>>>>>>>>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>>>>>> [email protected] 1 >>>>>>>>>>>>> >>>>>>>>>>>>> On compute-12-22: >>>>>>>>>>>>> >>>>>>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command >>>>>>>>>>>>> --cols=500 >>>>>>>>>>>>> >>>>>>>>>>>>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>>>>>>>>>>>> 0 0 0 0 S \_ /bin/bash >>>>>>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh >>>>>>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>>>>>>>>>>>> 993 993 115 115 Ss | \_ -bash >>>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959 >>>>>>>>>>>>> 993 993 115 115 Rs | \_ ./pcharmm32 >>>>>>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>>>>>>>>>>>> 993 993 115 115 Ss \_ -bash >>>>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960 >>>>>>>>>>>>> 993 993 115 115 Rs \_ ./pcharmm32 >>>>>>>>>>>>> >>>>>>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in >>>>>>>>>>>> the messages file of the host >>>>>>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages >>>>>>>>>>>> >>>>>>>>>>>> I.e.: >>>>>>>>>>>> >>>>>>>>>>>> $ qconf -sconf >>>>>>>>>>>> ... >>>>>>>>>>>> loglevel log_info >>>>>>>>>>>> >>>>>>>>>>>> is set? >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote: >>>>>>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Reuti. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, I had that already set: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> qconf -sconf|fgrep execd_params >>>>>>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed >>>>>>>>>>>>>>> just fine when they go past the hard wall time clock. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, the majority of the jobs are not being killed when >>>>>>>>>>>>>>> they go past their wall time clock. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> How can I investigate this further? >>>>>>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>>>>>>>>>> >>>>>>>>>>>>>> (f w/o -) and post the relevant lines of the application please. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- Reuti >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I google this issue but did not see much help on the subject. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have several queues with hard wall clock limits like this >>>>>>>>>>>>>>>>> one: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> # qconf -sq queue | grep h_rt >>>>>>>>>>>>>>>>> h_rt 96:00:00 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past >>>>>>>>>>>>>>>>> the hard wall clock limit and continue to run. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these >>>>>>>>>>>>>>>>> entries: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have >>>>>>>>>>>>>>>>> finished since 42318s >>>>>>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are >>>>>>>>>>>>>>>> killed by `kill -9 -- -pgrp`. You can kill them by their >>>>>>>>>>>>>>>> additional group id, which is attached to all started >>>>>>>>>>>>>>>> processes even if the executed something like `setsid`: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> $ qconf -sconf >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If it's still not working, we have to investigate the process >>>>>>>>>>>>>>>> tree. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> HTH - Reuti >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> These entries correspond to the running jobs that should have >>>>>>>>>>>>>>>>> ended 96 hours ago, but they keep on running. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past >>>>>>>>>>>>>>>>> the 96 hour limit but yet complains they should have ended? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
