Am 13.03.2012 um 15:08 schrieb Lars van der bijl: > On 13 March 2012 13:55, Reuti <[email protected]> wrote: >> Am 13.03.2012 um 12:46 schrieb Lars van der bijl: >> >>> On 13 March 2012 12:32, Reuti <[email protected]> wrote: >>>> Am 13.03.2012 um 12:03 schrieb Lars van der bijl: >>>> >>>>> On 13 March 2012 11:18, Reuti <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> Where having the following problem. >>>>>>> >>>>>>> randomly on some task we start getting "CPU time limit exceeded". we >>>>>> >>>>>> You notice that in the messages file of SGE on the execution host or >>>>>> where do you get the statement? >>>>>> >>>>> >>>>> we get this in our stderr output. >>>> >>>> Then I would say it's not a limit by SGE. Can you set up any time limit in >>>> the appliation itself? >>> >>> not that I am aware of. the application is rendering a image and has >>> nothing setup to kill it on time. >>> we do have a limit on memory. >>> >>> >>>> >>>> >>>>>>> don't specify a time limit. we do specify h_vmem. >>>>>>> this only happens on some tasks and not other. even between same tasks >>>>>>> from a batch on the same machine. >>>>>> >>>>>> It could be a set limit in the queue definition (h_cpu), specified for >>>>>> some particular jobs (-l h_cpu=...). >>>>>> >>>>>> The time for an SGE limit is usually mentioned in the messages file. Is >>>>>> it always the same time? >>>>>> >>>>> >>>>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121 >>>>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie >>>>> general rescheduling on application error because: 03/13/2012 05:41:23 >>>>> [0:10105]: exit_status of job start = 100 >>>> >>>> So, the job was rescheduled (do you know why?), but the restart failed and >>>> put the job in error status (because of exit code 100). Do you see this? >>> >>> to force sge to error out or retry we check the exit status of the >>> task in the prolog. if it anything other then 0 and it has re-tries it >>> will exit 99 from the prolog. otherwise exit with 100. >>> we always have task dependent on the output and we don't want them to start. >>> >>> could a SIGXCPU >> >> Yes, SIGXCPU will generate this error message. > > I've put a trap in our run script to catch SIGXCPU SIGTERM and cause > it to exit with 100. we where getting jobs being killed without good > cause and starting up it's dependencies. > that where the 100 comes from then i guess. > > still no idea what could cause the SIGXCPU. could it be send by > mem_free or s_vmem?
Yes, it's even send for s_vmem as warning (man queue_conf). You set s_vmem in addition to h_vmem? -- Reuti >> >> -- Reuti >> >> >>> or a SIGTERM cause this? >>> >>> >>>> >>>> Can you elaborate in some why what is going on there in detail - is it >>>> supposed to fail if it's just rescheduled without cleaning any former >>>> files or so? >>>> >>>> -- Reuti >>>> >>>> >>>>> unless [0:10105] is the limit i'm not sure. >>>>> >>>>> >>>>> >>>>>> -- Reuti >>>> >> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
