On 13 March 2012 13:55, Reuti <[email protected]> wrote:
> Am 13.03.2012 um 12:46 schrieb Lars van der bijl:
>
>> On 13 March 2012 12:32, Reuti <[email protected]> wrote:
>>> Am 13.03.2012 um 12:03 schrieb Lars van der bijl:
>>>
>>>> On 13 March 2012 11:18, Reuti <[email protected]> wrote:
>>>>> Hi,
>>>>>
>>>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> Where having the following problem.
>>>>>>
>>>>>> randomly on some task we start getting "CPU time limit exceeded". we
>>>>>
>>>>> You notice that in the messages file of SGE on the execution host or 
>>>>> where do you get the statement?
>>>>>
>>>>
>>>> we get this in our stderr output.
>>>
>>> Then I would say it's not a limit by SGE. Can you set up any time limit in 
>>> the appliation itself?
>>
>> not that I am aware of. the application is rendering a image and has
>> nothing setup to kill it on time.
>> we do have a limit on memory.
>>
>>
>>>
>>>
>>>>>> don't specify a time limit. we do specify h_vmem.
>>>>>> this only happens on some tasks and not other. even between same tasks
>>>>>> from a batch on the same machine.
>>>>>
>>>>> It could be a set limit in the queue definition (h_cpu), specified for 
>>>>> some particular jobs (-l h_cpu=...).
>>>>>
>>>>> The time for an SGE limit is usually mentioned in the messages file. Is 
>>>>> it always the same time?
>>>>>
>>>>
>>>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121
>>>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie
>>>> general rescheduling on application error because: 03/13/2012 05:41:23
>>>> [0:10105]: exit_status of job start = 100
>>>
>>> So, the job was rescheduled (do you know why?), but the restart failed and 
>>> put the job in error status (because of exit code 100). Do you see this?
>>
>> to force sge to error out or retry we check the exit status of the
>> task in the prolog. if it anything other then 0 and it has re-tries it
>> will exit 99 from the prolog. otherwise exit with 100.
>> we always have task dependent on the output and we don't want them to start.
>>
>> could a SIGXCPU
>
> Yes, SIGXCPU will generate this error message.

I've put a trap in our run script to catch SIGXCPU SIGTERM and cause
it to exit with 100. we where getting jobs being killed without good
cause and starting up it's dependencies.
that where the 100 comes from then i guess.

still no idea what could cause the SIGXCPU. could it be send by
mem_free or s_vmem?

>
> -- Reuti
>
>
>> or a SIGTERM cause this?
>>
>>
>>>
>>> Can you elaborate in some why what is going on there in detail - is it 
>>> supposed to fail if it's just rescheduled without cleaning any former files 
>>> or so?
>>>
>>> -- Reuti
>>>
>>>
>>>> unless [0:10105] is the limit i'm not sure.
>>>>
>>>>
>>>>
>>>>> -- Reuti
>>>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to