afair, when vmem is passed, the abort message says KILL, not XCPU. But anyway 433M is below the limit (soft 450, hard 480), so I don't think the memory is involved here.
J 2012/10/19 Reuti <[email protected]>: > Am 19.10.2012 um 19:01 schrieb Jérémie Dubois-Lacoste: > >> One user on our cluster is having this problem, that I've never >> seen before. According to him there is some randomness, the >> same job may succeed or fail from time to time. >> When the job abbort he gets this e-mail: >> >> Start Time = 10/19/2012 15:25:17 >> End Time = 10/19/2012 17:07:20 >> CPU = 01:40:35 >> Max vmem = 433.707M > > It's also send if s_vmem is passed. > > -- Reuti > > >> failed assumedly after job because: >> job 5433573.1 died through signal XCPU (24) >> >> So the job was running for 1h40, then get killed. >> >> But the queue that he submitted to has a CPU time limit >> of one week. Among the output of "qconf -sq <queue>": >> s_cpu 168:00:00 >> h_cpu 169:00:00 >> >> Any idea? >> >> Jérémie >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
