On 7 September 2012 15:55, Lars van der bijl <[email protected]> wrote: > Hey everyone, > > We have been using the grid for VFX for a few years and our job > dependencies have grown a lot. A "job" is a collection of tasks. All > our tasks have "batches" so we regularly run a job of 50 tasks with > 2000+ batches. > > Very often a batch dies for reasons such as memory limits, seg fault, > or a kill command; getting a 139 or a 137 is also very common. The 137 is a SIGKILL possibly using s_* rather than h_* would give you a warning signal first. Other job killing signals are catchable so you should be able to deal with them by having the job itself exit with an appropriate status (use a wrapper script if necessary). If for some reason you can't/don't want to modify the job script then a starter_method could presumably handle this.
> This has the nasty side effect of the task being removed from the > queue completely and raising a 100 in the epilog won't help. > > Also, rescheduling a task often doesn't kill the task on the original > host, causing the first host to corrupt the second host's output. > > my question is how difficult would it be to get a task not to be > removed from the queue but be placed in a "dormant" state, so that it > can be re-activated for another run? > > would it be possible to change the execd to put any job that does not > exit with 0 into an error state? regardless of it being a kill -9? > > greetings, > > Lars > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
