Am 10.09.2012 um 10:20 schrieb Lars van der bijl:

>> <snip>
>> Why do you want to release only certain array tasks? Usually a plain 
>> `qhold`/`qrls` like `qalter` will affect the complete array job, i.e. all 
>> tasks. If for example task 26 of the first job fails, you only want to block 
>> task 26 of job 2 and let all other run?
>> 
> 
> yes exactly. but knowing which things to unblock would be tricking.
> unless there is information in the epilog on which task should be
> unblocked in job2.

As it's your workflow, you could record in the the job context with `qsub -ac 
...`


>> Nevertheless, the above  commands allow a task range to be given or a single 
>> task index:
>> 
>> $ qrls 1234 -t 1-10
>> $ qrls 1234.42
>> 
>> will release only tasks 1-10 and the others are still on hold.
>> 
>> 
>>>> 
>>>> - Create a special queue for some kind of `enabler' jobs which run forever 
>>>> (loop e.g. once a minute until they quit), the original job will 
>>>> create/touch a special file for which the `enabler' is waiting. If the 
>>>> existence of the relevant file is detected, the `enabler' can release a 
>>>> hold of a certain job or even just submit the successor job.
>>>> 
>>>> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ 
>>>> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can 
>>>> check for files. But the jobs will be submitted during the workflow and 
>>>> not all in advance. Maybe it is useful anyway.
>>>> 
>>>> -- Reuti
>>> 
>>> it would still be nice to know if it where possible to know implement
>>> the "dormant" task approach. the company I work for would be willing
>>> to pay for such development. depending on the feasibility.
>> 
>> I'm still not sure what you mean by "dormant" state, as error state is not 
>> sufficient. Similar you can use `qhold 1234.42` and `qmod -rj 1234.42` to 
>> put the task 42 back into waiting state.
>> 
>> In which state should a "dormant" task be?
> 
> if a task errors thats one thing. our wrappers catch that. see if you
> hit the retry limit and exit with 100.

Good.


> but there a many cases it errors with 137 or 139 and it gets removed
> from the queue. or a task doesn't error but the host application spits
> out corrupt data.

Okay, now I see. You could use a script like:

#!/bin/sh
. /usr/sge/default/common/settings.sh
qalter -h u $JOB_ID
qmod -rj $JOB_ID
kill -9 -- -$1

for the "terminate_method $job_pid" the queue definition. Seeing "hRq" as a 
dormant state.

-- Reuti


> instead of removing a task i'd want to be able to run it again. just
> have it be put in a none active state or "dormant" so that I could run
> it again without having to submit a new set of task. we very rarely
> run a single task.
> they always have dependencies and always have batching. so being able
> to run a subset of tasks again without having to do a re-submission
> would make a huge different.
> 
> 
>> 
>> -- Reuti
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to