Re: [gridengine users] Jobs died through signal XCPU while not exceeding limit

Reuti Fri, 19 Oct 2012 12:52:42 -0700

BTW:

If you set:


$ qconf -sconf
...
loglevel                     log_info
...

you will get an entry in the messages file of the node when SGE discovers such 
a condition.

-- Reuti


Am 19.10.2012 um 20:47 schrieb Reuti:

> Am 19.10.2012 um 19:43 schrieb Jérémie Dubois-Lacoste:
> 
>> afair, when vmem is passed, the abort message says KILL,
>> not XCPU. But anyway 433M is below the limit (soft 450,
>> hard 480), so I don't think the memory is involved here.
> 
> Defined by M or m?
> 
> M = base 1024
> m = base 1000
> 
> -- Reuti
> 
> (man sge_types)
> 
> 
>> 2012/10/19 Reuti <[email protected]>:
>>> Am 19.10.2012 um 19:01 schrieb Jérémie Dubois-Lacoste:
>>> 
>>>> One user on our cluster is having this problem, that I've never
>>>> seen before. According to him there is some randomness, the
>>>> same job may succeed or fail from time to time.
>>>> When the job abbort he gets this e-mail:
>>>> 
>>>> Start Time       = 10/19/2012 15:25:17
>>>> End Time         = 10/19/2012 17:07:20
>>>> CPU              = 01:40:35
>>>> Max vmem         = 433.707M
>>> 
>>> It's also send if s_vmem is passed.
>>> 
>>> -- Reuti
>>> 
>>> 
>>>> failed assumedly after job because:
>>>> job 5433573.1 died through signal XCPU (24)
>>>> 
>>>> So the job was running for 1h40, then get killed.
>>>> 
>>>> But the queue that he submitted to has a CPU time limit
>>>> of one week. Among the output of "qconf -sq <queue>":
>>>> s_cpu                 168:00:00
>>>> h_cpu                 169:00:00
>>>> 
>>>> Any idea?
>>>> 
>>>> Jérémie
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs died through signal XCPU while not exceeding limit

Reply via email to