Hi, Am 30.10.2012 um 19:31 schrieb Joseph Farran:
> I google this issue but did not see much help on the subject. > > I have several queues with hard wall clock limits like this one: > > # qconf -sq queue | grep h_rt > h_rt 96:00:00 > > I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall > clock limit and continue to run. > > Look at GE qmaster logs, I see dozens and dozens of these entries: > > 10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since > 42318s Maybe they jumped out of the process tree (usually jobs are killed by `kill -9 -- -pgrp`. You can kill them by their additional group id, which is attached to all started processes even if the executed something like `setsid`: $ qconf -sconf ... execd_params ENABLE_ADDGRP_KILL=TRUE If it's still not working, we have to investigate the process tree. HTH - Reuti > > These entries correspond to the running jobs that should have ended 96 hours > ago, but they keep on running. > > Why is GE not killing these jobs correctly when they run past the 96 hour > limit but yet complains they should have ended? > > > > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
