Am 28.09.2012 um 05:14 schrieb Vamsi Krishna:

> Exit status 140 - some where i read on internet, excuse if it is wrong, May  
> i get more details about this exit status and why this is killed with signal 
> 12. Actually nothing is in /default/spool/`hostname`/messages. i found the 
> messages only in qmaster/messages.
> 
> i found only one message in default/spool/`hostname`/messages

Maybe the log level needs to be adjusted:

$ qconf -sconf
...
loglevel                     log_info

-- Reuti


> starting up SGE 6.2u5 (lx24-amd64)
> 
> Regards
> PVK
> 
> On Thu, Sep 27, 2012 at 11:50 PM, Reuti <[email protected]> wrote:
> Am 27.09.2012 um 19:41 schrieb Vamsi Krishna:
> 
>> those were inputs for debugging. 
>> 
>> job 1058200.1 failed on host  assumedly after job because: job 1058200.1 
>> died through signal USR2 (12)
>> 
>> 09/26/2012 17:47:02|worker|E|denied: job "1058200" does not exist 
>> 
>> 
>> 
>> 50 out of 80 batch jobs got killed in the similar way and also one of the 
>> job in queue was also killed., does qmaster needs reboot. 
>> 
>>  
>> 
>> On Thu, Sep 27, 2012 at 9:39 PM, Reuti <[email protected]> wrote:
>> Am 26.09.2012 um 13:48 schrieb Vamsi Krishna:
>> 
>>> Exit code 140: The job exceeded the "wall clock" time limit, h_rt is setto 
>>> infinity
> 
> Who stated that exit code 140 is "wall clock" exceeded and nothing else? Did 
> you verify it in the messages file of the shepherd on the node's spooling 
> directory?
> 
> -- Reuti
>  
> 
>>> submit with -notify by default.
>> 
>> Is this a statement or a question? There can be more reasons for SIGUSR2 
>> like a passed memory limit as a result of -notify, or it can only be warned 
>> as someone killed the job with a `qdel`.
>> 
>> How can it run into h_rt when it's set to infinity?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> --PVK
>>> 
>>> On Wed, Sep 26, 2012 at 12:46 PM, Reuti <[email protected]> wrote:
>>> Am 26.09.2012 um 08:53 schrieb Vamsi Krishna:
>>> 
>>> > some of the batch jobs are killed and qacct -j of the job id
>>> >
>>> > failed       100 : assumedly after job
>>> > exit_status  140
>>> 
>>> It's 128 + 12 = SIGUSR2. So what can cause this signal to be generated?
>>> 
>>> Something in your job?
>>> 
>>> You submit with -notify?
>>> 
>>> -- Reuti
>>> 
>>> 
>>> >
>>> >
>>> > what could be the reason.
>>> >
>>> > Regards
>>> > PVK
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > [email protected]
>>> > https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
>> 
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to