Am 11.02.2015 um 20:13 schrieb Michael Stauffer <[email protected]>: > > On Wed, Feb 11, 2015 at 2:02 PM, Reuti <[email protected]> wrote: > Hi, > > > Am 11.02.2015 um 19:28 schrieb Michael Stauffer <[email protected]>: > > > > Hi, > > > > Is there a way to easily query if a job is idle or otherwise stuck even > > though a queue state says it's running? I've seen some old jobs that are > > listed as running in the queue, but upon investigation on their compute > > node there is no cpu activity associated with the processes, there are no > > error messages in output files. > > The used CPU time you can check by looking at the "usage" line in the `qstat > -j <job_id>` output. > > Any logic to have a safe indication whether a job is stuck in an infinity > loop or still computing won't be easy to be implemented and will most likely > depend on each particular application, whether there are any output or > scratch files which can be checked too. But even then the same output may > repeatedly being written thereto. > > We have even jobs which compute (apparently) fine, but only by manual > investigation one can say that the computed values converge to a wrong state > or are oscillating between states and won't stop ever. > > -- Reuti > > Thanks Reuti. I can see how this would be difficult. I may use the 'usage' > line from qstat. I could check every N hours, writing the usage output for > each running job to a file, then check the current usage stats against the > previous run's file and look for lines that haven't changed at all. To be > safe I'd just then email the user to suggset they take a look. > > This won't catch instances of jobs that are stuck in loops of course, but at > least it'll catch completely hung jobs. > > How often are a job's stats updated? Looks like every 40 seconds?
As defined in "load_report_time" IIRC. -- Reuti > > -M > > > > I can devise a script to do this, but if there's already something for this > > I'd just use that. Thanks. > > > > -M > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
