On 7 September 2012 15:55, Lars van der bijl <[email protected]> wrote:
> Hey everyone,
>
> We have been using the grid for VFX for a few years and our job
> dependencies have grown a lot. A "job" is a collection of tasks. All
> our tasks have "batches" so we regularly run a job of 50 tasks with
> 2000+ batches.
>
> Very often a batch dies for reasons such as memory limits, seg fault,
> or a kill command; getting a 139 or a 137 is also very common.
The 137 is a SIGKILL possibly using s_* rather than h_* would give you
a warning
signal first.  Other job killing signals are catchable so you should be able
to deal with them by  having the job itself exit with an appropriate status
(use a wrapper script if necessary).   If for some reason you
can't/don't want to
modify the job script then a starter_method could presumably handle this.


> This has the nasty side effect of the task being removed from the
> queue completely and raising a 100 in the epilog won't help.
>
> Also, rescheduling a task often doesn't kill the task on the original
> host, causing the first host to corrupt the second host's output.
>
> my question is how difficult would it be to get a task not to be
> removed from the queue but be placed in a "dormant" state, so that it
> can be re-activated for another run?
>
> would it be possible to change the execd to put any job that does not
> exit with 0 into an error state? regardless of it being a kill -9?
>
> greetings,
>
> Lars
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to