Hi,

Am 30.10.2012 um 19:31 schrieb Joseph Farran:

> I google this issue but did not see much help on the subject.
> 
> I have several queues with hard wall clock limits like this one:
> 
> # qconf -sq queue  | grep h_rt
> h_rt                  96:00:00
> 
> I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall 
> clock limit and continue to run.
> 
> Look at GE qmaster logs, I see dozens and dozens of these entries:
> 
>    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since 
> 42318s

Maybe they jumped out of the process tree (usually jobs are killed by `kill -9 
-- -pgrp`. You can kill them by their additional group id, which is attached to 
all started processes even if the executed something like `setsid`:

$ qconf -sconf
...
execd_params                 ENABLE_ADDGRP_KILL=TRUE

If it's still not working, we have to investigate the process tree.

HTH - Reuti


> 
> These entries correspond to the running jobs that should have ended 96 hours 
> ago, but they keep on running.
> 
> Why is GE not killing these jobs correctly when they run past the 96 hour 
> limit but yet complains they should have ended?
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to