Am 27.09.2012 um 19:41 schrieb Vamsi Krishna:

> those were inputs for debugging. 
> 
> job 1058200.1 failed on host  assumedly after job because: job 1058200.1 died 
> through signal USR2 (12)
> 
> 09/26/2012 17:47:02|worker|E|denied: job "1058200" does not exist 
> 
> 
> 
> 50 out of 80 batch jobs got killed in the similar way and also one of the job 
> in queue was also killed., does qmaster needs reboot. 
> 
>  
> 
> On Thu, Sep 27, 2012 at 9:39 PM, Reuti <[email protected]> wrote:
> Am 26.09.2012 um 13:48 schrieb Vamsi Krishna:
> 
>> Exit code 140: The job exceeded the "wall clock" time limit, h_rt is setto 
>> infinity

Who stated that exit code 140 is "wall clock" exceeded and nothing else? Did 
you verify it in the messages file of the shepherd on the node's spooling 
directory?

-- Reuti
 

>> submit with -notify by default.
> 
> Is this a statement or a question? There can be more reasons for SIGUSR2 like 
> a passed memory limit as a result of -notify, or it can only be warned as 
> someone killed the job with a `qdel`.
> 
> How can it run into h_rt when it's set to infinity?
> 
> -- Reuti
> 
> 
> 
>> --PVK
>> 
>> On Wed, Sep 26, 2012 at 12:46 PM, Reuti <[email protected]> wrote:
>> Am 26.09.2012 um 08:53 schrieb Vamsi Krishna:
>> 
>> > some of the batch jobs are killed and qacct -j of the job id
>> >
>> > failed       100 : assumedly after job
>> > exit_status  140
>> 
>> It's 128 + 12 = SIGUSR2. So what can cause this signal to be generated?
>> 
>> Something in your job?
>> 
>> You submit with -notify?
>> 
>> -- Reuti
>> 
>> 
>> >
>> >
>> > what could be the reason.
>> >
>> > Regards
>> > PVK
>> >
>> > _______________________________________________
>> > users mailing list
>> > [email protected]
>> > https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> 

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to