Am 27.09.2012 um 19:41 schrieb Vamsi Krishna: > those were inputs for debugging. > > job 1058200.1 failed on host assumedly after job because: job 1058200.1 died > through signal USR2 (12) > > 09/26/2012 17:47:02|worker|E|denied: job "1058200" does not exist > > > > 50 out of 80 batch jobs got killed in the similar way and also one of the job > in queue was also killed., does qmaster needs reboot. > > > > On Thu, Sep 27, 2012 at 9:39 PM, Reuti <[email protected]> wrote: > Am 26.09.2012 um 13:48 schrieb Vamsi Krishna: > >> Exit code 140: The job exceeded the "wall clock" time limit, h_rt is setto >> infinity
Who stated that exit code 140 is "wall clock" exceeded and nothing else? Did you verify it in the messages file of the shepherd on the node's spooling directory? -- Reuti >> submit with -notify by default. > > Is this a statement or a question? There can be more reasons for SIGUSR2 like > a passed memory limit as a result of -notify, or it can only be warned as > someone killed the job with a `qdel`. > > How can it run into h_rt when it's set to infinity? > > -- Reuti > > > >> --PVK >> >> On Wed, Sep 26, 2012 at 12:46 PM, Reuti <[email protected]> wrote: >> Am 26.09.2012 um 08:53 schrieb Vamsi Krishna: >> >> > some of the batch jobs are killed and qacct -j of the job id >> > >> > failed 100 : assumedly after job >> > exit_status 140 >> >> It's 128 + 12 = SIGUSR2. So what can cause this signal to be generated? >> >> Something in your job? >> >> You submit with -notify? >> >> -- Reuti >> >> >> > >> > >> > what could be the reason. >> > >> > Regards >> > PVK >> > >> > _______________________________________________ >> > users mailing list >> > [email protected] >> > https://gridengine.org/mailman/listinfo/users >> >> > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
