Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Joseph Farran Tue, 30 Oct 2012 16:31:40 -0700

Looking at one of the other running job (that should have ended by now), I 
don't see the notify:


# cat /var/spool/ge/qmaster/job_scripts/12923 | fgrep notify

# qstat| grep 12923
  12923 0.50500 dna.pmf_15 amentes      r     10/24/2012 18:59:08 
[email protected]          1



On 10/30/2012 04:18 PM, Reuti wrote:

Am 31.10.2012 um 00:13 schrieb Joseph Farran:

At first, I only had the hard wall clock "h_rt", but a while ago I also added 
the soft one:

Here are all of the related fields:

# qconf -sq free2 | egrep "rt|notify|terminate"
shell_start_mode      posix_compliant
starter_method        NONE
terminate_method      NONE
notify                00:00:60
s_rt                  96:00:00
h_rt                  96:00:00

Notify is set to 60, but I don't know what this does.

Were they also submitted with -notify? There was (is) an issue if both warnings 
by s_rt and -notify are requested. The warning to the job are send every 90 
seconds but it's never getting killed.

-- Reuti

On 10/30/2012 04:06 PM, Reuti wrote:

Am 31.10.2012 um 00:03 schrieb Joseph Farran:

The strace shows job running ok:  doing work and then writing to a file.

I was able to kill the jobs ( 1-core each ) just fine with "kill -9".

Looking at the qmaster log after a few minutes said:

10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12960.1
10/30/2012 15:58:41|worker|hpc|I|job 12960.1 finished on host 
compute-12-22.local
10/30/2012 15:58:41|worker|hpc|I|removing trigger to terminate job 12959.1
10/30/2012 15:58:41|worker|hpc|I|job 12959.1 finished on host 
compute-12-22.local

Did you define s_rt and -notify too?

-- Reuti

So GE cleared out the jobs ok.   Not sure why the node sge is not killing 
correctly.

Oh well, thanks Reuti.   I will keep playing with this...



On 10/30/2012 03:53 PM, Reuti wrote:

Am 30.10.2012 um 23:45 schrieb Joseph Farran:

No:

# qconf -sq free2 | fgrep terminate
terminate_method      NONE

Is the process still doing something serious or hanging somewhere in a loop:

$ strace -p 1234

and 1234 is the pid of the process on the node (you have to be root or owner of 
the process).

Afterwards: is a kill -9 1234 by hand successful?

-- Reuti

On 10/30/2012 03:07 PM, Reuti wrote:

Mmh, was the terminate method redefined in the queue configuration of the queue 
in question?


Am 30.10.2012 um 23:04 schrieb Joseph Farran:

No, still no cigar.

# cat  /var/spool/ge/compute-12-22/messages | grep wall
#

Here is what is strange.

Some jobs do get killed just fine.   One job that just went over the time limit 
on another queue, GE killed it and here is the log:

10/30/2012 14:32:06|  main|compute-1-7|I|registered at qmaster host "hpc.local"
10/30/2012 14:32:06|  main|compute-1-7|I|Reconnected to qmaster - enabled 
delayed job reporting period
10/30/2012 14:42:04|  main|compute-1-7|I|Delayed job reporting period finished
10/30/2012 14:57:35|  main|compute-1-7|W|job 12730.1 exceeded hard wallclock 
time - initiate terminate method
10/30/2012 14:57:36|  main|compute-1-7|I|SIGNAL jid: 12730 jatask: 1 signal: 
KILL


On 10/30/2012 03:00 PM, Reuti wrote:

Sorry, should be like:

10/30/2012 22:59:50|  main|pc15370|W|job 5281.1 exceeded hard wallclock time - 
initiate terminate method


Am 30.10.2012 um 22:57 schrieb Joseph Farran:

Did not have loglevel set to log_info, so I updated it, restarted GE on the 
master and softstop and start on the compute node.

I got a lot more log information now, but still no cigar:

# cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
#

Checked a few other compute nodes as well for the "h_rt" and nothing either.



On 10/30/2012 01:49 PM, Reuti wrote:

Am 30.10.2012 um 20:18 schrieb Joseph Farran:

Here is one case:

qstat| egrep "12959|12960"
  12959 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
[email protected]          1
  12960 0.50500 dna.pmf_17 amentes      r     10/24/2012 18:59:12 
[email protected]          1

On compute-12-22:

compute-12-22 ~]# ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500

    0   570     0   201 Sl   /data/hpc/ge/bin/lx-amd64/sge_execd
    0     0     0     0 S     \_ /bin/bash 
/data/hpc/ge/load-sensor-cores-in-use.sh
    0   570     0   201 S     \_ sge_shepherd-12959 -bg
  993   993   115   115 Ss    |   \_ -bash 
/var/spool/ge/compute-12-22/job_scripts/12959
  993   993   115   115 Rs    |       \_ ./pcharmm32
    0   570     0   201 S     \_ sge_shepherd-12960 -bg
  993   993   115   115 Ss        \_ -bash 
/var/spool/ge/compute-12-22/job_scripts/12960
  993   993   115   115 Rs            \_ ./pcharmm32

Good, then: do you see any remark about the h_rt being exceeded in the messages 
file of the host $SGE_ROOT/default/spool/compute-12-22/messages

I.e.:

$ qconf -sconf
...
loglevel                     log_info

is set?

-- Reuti

On 10/30/2012 12:07 PM, Reuti wrote:

Am 30.10.2012 um 20:02 schrieb Joseph Farran:

Hi Reuti.

Yes, I had that already set:

qconf -sconf|fgrep execd_params
execd_params                 ENABLE_ADDGRP_KILL=TRUE

What is strange is that 1 out of 10 jobs or so do get killed just fine when 
they go past the hard wall time clock.

However, the majority of the jobs are not being killed when they go past their 
wall time clock.

How can I investigate this further?

ps -e f -o ruid,euid,rgid,egid,stat,command --cols=500

(f w/o -) and post the relevant lines of the application please.

-- Reuti

On 10/30/2012 11:44 AM, Reuti wrote:

Hi,

Am 30.10.2012 um 19:31 schrieb Joseph Farran:

I google this issue but did not see much help on the subject.

I have several queues with hard wall clock limits like this one:

# qconf -sq queue  | grep h_rt
h_rt                  96:00:00

I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall 
clock limit and continue to run.

Look at GE qmaster logs, I see dozens and dozens of these entries:

    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since 
42318s

Maybe they jumped out of the process tree (usually jobs are killed by `kill -9 
-- -pgrp`. You can kill them by their additional group id, which is attached to 
all started processes even if the executed something like `setsid`:

$ qconf -sconf
...
execd_params                 ENABLE_ADDGRP_KILL=TRUE

If it's still not working, we have to investigate the process tree.

HTH - Reuti

These entries correspond to the running jobs that should have ended 96 hours 
ago, but they keep on running.

Why is GE not killing these jobs correctly when they run past the 96 hour limit 
but yet complains they should have ended?






_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

Reply via email to