Am 31.10.2012 um 00:13 schrieb Joseph Farran: > At first, I only had the hard wall clock "h_rt", but a while ago I also added > the soft one: > > Here are all of the related fields: > > # qconf -sq free2 | egrep "rt|notify|terminate" > shell_start_mode posix_compliant > starter_method NONE > terminate_method NONE > notify 00:00:60 > s_rt 96:00:00 > h_rt 96:00:00 > > Notify is set to 60, but I don't know what this does.
Were they also submitted with -notify? There was (is) an issue if both warnings by s_rt and -notify are requested. The warning to the job are send every 90 seconds but it's never getting killed. -- Reuti > On 10/30/2012 04:06 PM, Reuti wrote: >> Am 31.10.2012 um 00:03 schrieb Joseph Farran: >> >>> The strace shows job running ok: doing work and then writing to a file. >>> >>> I was able to kill the jobs ( 1-core each ) just fine with "kill -9". >>> >>> Looking at the qmaster log after a few minutes said: >>> >>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1 >>> 10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host >>> compute-12-22.local >>> 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1 >>> 10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host >>> compute-12-22.local >> Did you define s_rt and -notify too? >> >> -- Reuti >> >> >> >>> So GE cleared out the jobs ok. Not sure why the node sge is not killing >>> correctly. >>> >>> Oh well, thanks Reuti. I will keep playing with this... >>> >>> >>> >>> On 10/30/2012 03:53 PM, Reuti wrote: >>>> Am 30.10.2012 um 23:45 schrieb Joseph Farran: >>>> >>>>> No: >>>>> >>>>> # qconf -sq free2 | fgrep terminate >>>>> terminate_method NONE >>>> Is the process still doing something serious or hanging somewhere in a >>>> loop: >>>> >>>> $ strace -p 1234 >>>> >>>> and 1234 is the pid of the process on the node (you have to be root or >>>> owner of the process). >>>> >>>> Afterwards: is a kill -9 1234 by hand successful? >>>> >>>> -- Reuti >>>> >>>> >>>>> On 10/30/2012 03:07 PM, Reuti wrote: >>>>>> Mmh, was the terminate method redefined in the queue configuration of >>>>>> the queue in question? >>>>>> >>>>>> >>>>>> Am 30.10.2012 um 23:04 schrieb Joseph Farran: >>>>>> >>>>>>> No, still no cigar. >>>>>>> >>>>>>> # cat /var/spool/ge/compute-12-22/messages | grep wall >>>>>>> # >>>>>>> >>>>>>> Here is what is strange. >>>>>>> >>>>>>> Some jobs do get killed just fine. One job that just went over the >>>>>>> time limit on another queue, GE killed it and here is the log: >>>>>>> >>>>>>> 10/30/2012 14:32:06| main|compute-1-7|I|registered at qmaster host >>>>>>> "hpc.local" >>>>>>> 10/30/2012 14:32:06| main|compute-1-7|I|Reconnected to qmaster - >>>>>>> enabled delayed job reporting period >>>>>>> 10/30/2012 14:42:04| main|compute-1-7|I|Delayed job reporting period >>>>>>> finished >>>>>>> 10/30/2012 14:57:35| main|compute-1-7|W|job 12730.1 exceeded hard >>>>>>> wallclock time - initiate terminate method >>>>>>> 10/30/2012 14:57:36| main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 >>>>>>> signal: KILL >>>>>>> >>>>>>> >>>>>>> On 10/30/2012 03:00 PM, Reuti wrote: >>>>>>>> Sorry, should be like: >>>>>>>> >>>>>>>> 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard >>>>>>>> wallclock time - initiate terminate method >>>>>>>> >>>>>>>> >>>>>>>> Am 30.10.2012 um 22:57 schrieb Joseph Farran: >>>>>>>> >>>>>>>>> Did not have loglevel set to log_info, so I updated it, restarted GE >>>>>>>>> on the master and softstop and start on the compute node. >>>>>>>>> >>>>>>>>> I got a lot more log information now, but still no cigar: >>>>>>>>> >>>>>>>>> # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt >>>>>>>>> # >>>>>>>>> >>>>>>>>> Checked a few other compute nodes as well for the "h_rt" and nothing >>>>>>>>> either. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/30/2012 01:49 PM, Reuti wrote: >>>>>>>>>> Am 30.10.2012 um 20:18 schrieb Joseph Farran: >>>>>>>>>> >>>>>>>>>>> Here is one case: >>>>>>>>>>> >>>>>>>>>>> qstat| egrep "12959|12960" >>>>>>>>>>> 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>>>> [email protected] 1 >>>>>>>>>>> 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 >>>>>>>>>>> [email protected] 1 >>>>>>>>>>> >>>>>>>>>>> On compute-12-22: >>>>>>>>>>> >>>>>>>>>>> compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command >>>>>>>>>>> --cols=500 >>>>>>>>>>> >>>>>>>>>>> 0 570 0 201 Sl /data/hpc/ge/bin/lx-amd64/sge_execd >>>>>>>>>>> 0 0 0 0 S \_ /bin/bash >>>>>>>>>>> /data/hpc/ge/load-sensor-cores-in-use.sh >>>>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12959 -bg >>>>>>>>>>> 993 993 115 115 Ss | \_ -bash >>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12959 >>>>>>>>>>> 993 993 115 115 Rs | \_ ./pcharmm32 >>>>>>>>>>> 0 570 0 201 S \_ sge_shepherd-12960 -bg >>>>>>>>>>> 993 993 115 115 Ss \_ -bash >>>>>>>>>>> /var/spool/ge/compute-12-22/job_scripts/12960 >>>>>>>>>>> 993 993 115 115 Rs \_ ./pcharmm32 >>>>>>>>>>> >>>>>>>>>> Good, then: do you see any remark about the h_rt being exceeded in >>>>>>>>>> the messages file of the host >>>>>>>>>> $SGE_ROOT/default/spool/compute-12-22/messages >>>>>>>>>> >>>>>>>>>> I.e.: >>>>>>>>>> >>>>>>>>>> $ qconf -sconf >>>>>>>>>> ... >>>>>>>>>> loglevel log_info >>>>>>>>>> >>>>>>>>>> is set? >>>>>>>>>> >>>>>>>>>> -- Reuti >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 10/30/2012 12:07 PM, Reuti wrote: >>>>>>>>>>>> Am 30.10.2012 um 20:02 schrieb Joseph Farran: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Reuti. >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, I had that already set: >>>>>>>>>>>>> >>>>>>>>>>>>> qconf -sconf|fgrep execd_params >>>>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>>>> >>>>>>>>>>>>> What is strange is that 1 out of 10 jobs or so do get killed just >>>>>>>>>>>>> fine when they go past the hard wall time clock. >>>>>>>>>>>>> >>>>>>>>>>>>> However, the majority of the jobs are not being killed when they >>>>>>>>>>>>> go past their wall time clock. >>>>>>>>>>>>> >>>>>>>>>>>>> How can I investigate this further? >>>>>>>>>>>> ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500 >>>>>>>>>>>> >>>>>>>>>>>> (f w/o -) and post the relevant lines of the application please. >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 10/30/2012 11:44 AM, Reuti wrote: >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 30.10.2012 um 19:31 schrieb Joseph Farran: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I google this issue but did not see much help on the subject. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have several queues with hard wall clock limits like this one: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> # qconf -sq queue | grep h_rt >>>>>>>>>>>>>>> h_rt 96:00:00 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am running Son of Grid engine 8.1.2 and many jobs run past >>>>>>>>>>>>>>> the hard wall clock limit and continue to run. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Look at GE qmaster logs, I see dozens and dozens of these >>>>>>>>>>>>>>> entries: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have >>>>>>>>>>>>>>> finished since 42318s >>>>>>>>>>>>>> Maybe they jumped out of the process tree (usually jobs are >>>>>>>>>>>>>> killed by `kill -9 -- -pgrp`. You can kill them by their >>>>>>>>>>>>>> additional group id, which is attached to all started processes >>>>>>>>>>>>>> even if the executed something like `setsid`: >>>>>>>>>>>>>> >>>>>>>>>>>>>> $ qconf -sconf >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> execd_params ENABLE_ADDGRP_KILL=TRUE >>>>>>>>>>>>>> >>>>>>>>>>>>>> If it's still not working, we have to investigate the process >>>>>>>>>>>>>> tree. >>>>>>>>>>>>>> >>>>>>>>>>>>>> HTH - Reuti >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> These entries correspond to the running jobs that should have >>>>>>>>>>>>>>> ended 96 hours ago, but they keep on running. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Why is GE not killing these jobs correctly when they run past >>>>>>>>>>>>>>> the 96 hour limit but yet complains they should have ended? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
