Re: [gridengine users] task exit status problems

Reuti Fri, 07 Sep 2012 15:11:43 -0700

Am 07.09.2012 um 23:12 schrieb Lars van der bijl:

> On 7 September 2012 19:41, Reuti <[email protected]> wrote:
>> Am 07.09.2012 um 18:39 schrieb Lars van der bijl:
>> 
>>> On 7 September 2012 17:48, Reuti <[email protected]> wrote:
>>>> Am 07.09.2012 um 17:45 schrieb Lars van der bijl:
>>>> 
>>>>> On 7 September 2012 17:23, Reuti <[email protected]> wrote:
>>>>>>> would it be possible to change the execd to put any job that does not
>>>>>>> exit with 0 into an error state? regardless of it being a kill -9?
>>>>>> 
>>>>>> You can rerun the job automatically if you exit the epilog with 99.
>>>>> 
>>>>> yes but with 137 or 139 i can't. and as the task hasn't successfully
>>>>> finished i don't want it to start it's dependencies. i'd rather it
>>>>> just go to a error state.
>>>> 
>>>> You observe, that a job being rescheduled by exit 99 will trigger its 
>>>> successors by -hold_jid to start?
>>>> 
>>> no when i'm able to raise a 99 exit status it will not trigger it's
>>> dependencies. however a task killed because of 137 or 139 do.
>>> and I'd rather them error out with 100 them to be removed from the
>>> queue all together.
>>> 
>>> i know that the grid uses 137 when you request a qdel. and this makes
>>> it kinda hard to stop a task if anything else would be put in a 100
>>> error state.
>> 
>> No, the chain of commands is the other way round. The `qdel` will send 
>> sigkill to the job and remove it from the list of jobs in the system 
>> (whatever you do or set in the epilog doesn't matter, as the job is to be 
>> removed by the `qdel`).
>> 
>> You can for example:
>> 
>> - Submit all jobs with a user hold of the successor(s), this user hold you 
>> can be removed in the epilog of the predecessor if it ran successful. The 
>> name/jobid of the successor to be released could be put in a job context 
>> which you have to read in the epilog and act accordingly.
> 
> I could create this with my database layer however our system relies
> very heavily on batching. so task1 -> task2 with the same batch range
> but with different batch sizes. for example 1-100:25 for task1 and


^^^ job1

> 1-100:1 for task2. how would I be able to find out what the other

^^^ job2

> range is and how would i be able to un-hold that specific range?

Why do you want to release only certain array tasks? Usually a plain 
`qhold`/`qrls` like `qalter` will affect the complete array job, i.e. all 
tasks. If for example task 26 of the first job fails, you only want to block 
task 26 of job 2 and let all other run?

Nevertheless, the above  commands allow a task range to be given or a single 
task index:

$ qrls 1234 -t 1-10
$ qrls 1234.42

will release only tasks 1-10 and the others are still on hold.


>> 
>> - Create a special queue for some kind of `enabler' jobs which run forever 
>> (loop e.g. once a minute until they quit), the original job will 
>> create/touch a special file for which the `enabler' is waiting. If the 
>> existence of the relevant file is detected, the `enabler' can release a hold 
>> of a certain job or even just submit the successor job.
>> 
>> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ 
>> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can 
>> check for files. But the jobs will be submitted during the workflow and not 
>> all in advance. Maybe it is useful anyway.
>> 
>> -- Reuti
> 
> it would still be nice to know if it where possible to know implement
> the "dormant" task approach. the company I work for would be willing
> to pay for such development. depending on the feasibility.

I'm still not sure what you mean by "dormant" state, as error state is not 
sufficient. Similar you can use `qhold 1234.42` and `qmod -rj 1234.42` to put 
the task 42 back into waiting state.

In which state should a "dormant" task be?

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] task exit status problems

Reply via email to