BTW: If you set:
$ qconf -sconf ... loglevel log_info ... you will get an entry in the messages file of the node when SGE discovers such a condition. -- Reuti Am 19.10.2012 um 20:47 schrieb Reuti: > Am 19.10.2012 um 19:43 schrieb Jérémie Dubois-Lacoste: > >> afair, when vmem is passed, the abort message says KILL, >> not XCPU. But anyway 433M is below the limit (soft 450, >> hard 480), so I don't think the memory is involved here. > > Defined by M or m? > > M = base 1024 > m = base 1000 > > -- Reuti > > (man sge_types) > > >> 2012/10/19 Reuti <[email protected]>: >>> Am 19.10.2012 um 19:01 schrieb Jérémie Dubois-Lacoste: >>> >>>> One user on our cluster is having this problem, that I've never >>>> seen before. According to him there is some randomness, the >>>> same job may succeed or fail from time to time. >>>> When the job abbort he gets this e-mail: >>>> >>>> Start Time = 10/19/2012 15:25:17 >>>> End Time = 10/19/2012 17:07:20 >>>> CPU = 01:40:35 >>>> Max vmem = 433.707M >>> >>> It's also send if s_vmem is passed. >>> >>> -- Reuti >>> >>> >>>> failed assumedly after job because: >>>> job 5433573.1 died through signal XCPU (24) >>>> >>>> So the job was running for 1h40, then get killed. >>>> >>>> But the queue that he submitted to has a CPU time limit >>>> of one week. Among the output of "qconf -sq <queue>": >>>> s_cpu 168:00:00 >>>> h_cpu 169:00:00 >>>> >>>> Any idea? >>>> >>>> Jérémie >>>> >>>> _______________________________________________ >>>> users mailing list >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
