[gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Hi all. I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like this one: # qconf -sq queue | grep h_rt h_rt 96:00:00 I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall clock limit and

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Hi, Am 30.10.2012 um 19:31 schrieb Joseph Farran: I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like this one: # qconf -sq queue | grep h_rt h_rt 96:00:00 I am running Son of Grid engine 8.1.2 and many

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Hi Reuti. Yes, I had that already set: qconf -sconf|fgrep execd_params execd_params ENABLE_ADDGRP_KILL=TRUE What is strange is that 1 out of 10 jobs or so do get killed just fine when they go past the hard wall time clock. However, the majority of the jobs are not being

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 30.10.2012 um 20:02 schrieb Joseph Farran: Hi Reuti. Yes, I had that already set: qconf -sconf|fgrep execd_params execd_params ENABLE_ADDGRP_KILL=TRUE What is strange is that 1 out of 10 jobs or so do get killed just fine when they go past the hard wall time clock.

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Here is one case: qstat| egrep 12959|12960 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local 1 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local 1 On compute-12-22: compute-12-22 ~]# ps -e

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 30.10.2012 um 20:18 schrieb Joseph Farran: Here is one case: qstat| egrep 12959|12960 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local 1 12960 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Did not have loglevel set to log_info, so I updated it, restarted GE on the master and softstop and start on the compute node. I got a lot more log information now, but still no cigar: # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt # Checked a few other compute nodes as well for the

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Sorry, should be like: 10/30/2012 22:59:50| main|pc15370|W|job 5281.1 exceeded hard wallclock time - initiate terminate method Am 30.10.2012 um 22:57 schrieb Joseph Farran: Did not have loglevel set to log_info, so I updated it, restarted GE on the master and softstop and start on the

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
No, still no cigar. # cat /var/spool/ge/compute-12-22/messages | grep wall # Here is what is strange. Some jobs do get killed just fine. One job that just went over the time limit on another queue, GE killed it and here is the log: 10/30/2012 14:32:06| main|compute-1-7|I|registered at

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
No: # qconf -sq free2 | fgrep terminate terminate_method NONE On 10/30/2012 03:07 PM, Reuti wrote: Mmh, was the terminate method redefined in the queue configuration of the queue in question? Am 30.10.2012 um 23:04 schrieb Joseph Farran: No, still no cigar. # cat

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 30.10.2012 um 23:45 schrieb Joseph Farran: No: # qconf -sq free2 | fgrep terminate terminate_method NONE Is the process still doing something serious or hanging somewhere in a loop: $ strace -p 1234 and 1234 is the pid of the process on the node (you have to be root or owner of

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
The strace shows job running ok: doing work and then writing to a file. I was able to kill the jobs ( 1-core each ) just fine with kill -9. Looking at the qmaster log after a few minutes said: 10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1 10/30/2012

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 31.10.2012 um 00:03 schrieb Joseph Farran: The strace shows job running ok: doing work and then writing to a file. I was able to kill the jobs ( 1-core each ) just fine with kill -9. Looking at the qmaster log after a few minutes said: 10/30/2012 15:58:41|worker|hpc|I|removing

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 31.10.2012 um 00:13 schrieb Joseph Farran: At first, I only had the hard wall clock h_rt, but a while ago I also added the soft one: Here are all of the related fields: # qconf -sq free2 | egrep rt|notify|terminate shell_start_mode posix_compliant starter_methodNONE

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Reuti
Am 31.10.2012 um 00:30 schrieb Joseph Farran: Looking at one of the other running job (that should have ended by now), I don't see the notify: # cat /var/spool/ge/qmaster/job_scripts/12923 | fgrep notify # qstat| grep 12923 12923 0.50500 dna.pmf_15 amentes r 10/24/2012