Re: [gridengine users] Jobs died through signal XCPU while not exceeding limit

Reuti Fri, 19 Oct 2012 11:48:37 -0700

Am 19.10.2012 um 19:43 schrieb Jérémie Dubois-Lacoste:

> afair, when vmem is passed, the abort message says KILL,
> not XCPU. But anyway 433M is below the limit (soft 450,
> hard 480), so I don't think the memory is involved here.


Defined by M or m?

M = base 1024
m = base 1000

-- Reuti

(man sge_types)


> 2012/10/19 Reuti <[email protected]>:
>> Am 19.10.2012 um 19:01 schrieb Jérémie Dubois-Lacoste:
>> 
>>> One user on our cluster is having this problem, that I've never
>>> seen before. According to him there is some randomness, the
>>> same job may succeed or fail from time to time.
>>> When the job abbort he gets this e-mail:
>>> 
>>> Start Time       = 10/19/2012 15:25:17
>>> End Time         = 10/19/2012 17:07:20
>>> CPU              = 01:40:35
>>> Max vmem         = 433.707M
>> 
>> It's also send if s_vmem is passed.
>> 
>> -- Reuti
>> 
>> 
>>> failed assumedly after job because:
>>> job 5433573.1 died through signal XCPU (24)
>>> 
>>> So the job was running for 1h40, then get killed.
>>> 
>>> But the queue that he submitted to has a CPU time limit
>>> of one week. Among the output of "qconf -sq <queue>":
>>> s_cpu                 168:00:00
>>> h_cpu                 169:00:00
>>> 
>>> Any idea?
>>> 
>>> Jérémie
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs died through signal XCPU while not exceeding limit

Reply via email to