Am 07.09.2012 um 23:12 schrieb Lars van der bijl: > On 7 September 2012 19:41, Reuti <[email protected]> wrote: >> Am 07.09.2012 um 18:39 schrieb Lars van der bijl: >> >>> On 7 September 2012 17:48, Reuti <[email protected]> wrote: >>>> Am 07.09.2012 um 17:45 schrieb Lars van der bijl: >>>> >>>>> On 7 September 2012 17:23, Reuti <[email protected]> wrote: >>>>>>> would it be possible to change the execd to put any job that does not >>>>>>> exit with 0 into an error state? regardless of it being a kill -9? >>>>>> >>>>>> You can rerun the job automatically if you exit the epilog with 99. >>>>> >>>>> yes but with 137 or 139 i can't. and as the task hasn't successfully >>>>> finished i don't want it to start it's dependencies. i'd rather it >>>>> just go to a error state. >>>> >>>> You observe, that a job being rescheduled by exit 99 will trigger its >>>> successors by -hold_jid to start? >>>> >>> no when i'm able to raise a 99 exit status it will not trigger it's >>> dependencies. however a task killed because of 137 or 139 do. >>> and I'd rather them error out with 100 them to be removed from the >>> queue all together. >>> >>> i know that the grid uses 137 when you request a qdel. and this makes >>> it kinda hard to stop a task if anything else would be put in a 100 >>> error state. >> >> No, the chain of commands is the other way round. The `qdel` will send >> sigkill to the job and remove it from the list of jobs in the system >> (whatever you do or set in the epilog doesn't matter, as the job is to be >> removed by the `qdel`). >> >> You can for example: >> >> - Submit all jobs with a user hold of the successor(s), this user hold you >> can be removed in the epilog of the predecessor if it ran successful. The >> name/jobid of the successor to be released could be put in a job context >> which you have to read in the epilog and act accordingly. > > I could create this with my database layer however our system relies > very heavily on batching. so task1 -> task2 with the same batch range > but with different batch sizes. for example 1-100:25 for task1 and
^^^ job1 > 1-100:1 for task2. how would I be able to find out what the other ^^^ job2 > range is and how would i be able to un-hold that specific range? Why do you want to release only certain array tasks? Usually a plain `qhold`/`qrls` like `qalter` will affect the complete array job, i.e. all tasks. If for example task 26 of the first job fails, you only want to block task 26 of job 2 and let all other run? Nevertheless, the above commands allow a task range to be given or a single task index: $ qrls 1234 -t 1-10 $ qrls 1234.42 will release only tasks 1-10 and the others are still on hold. >> >> - Create a special queue for some kind of `enabler' jobs which run forever >> (loop e.g. once a minute until they quit), the original job will >> create/touch a special file for which the `enabler' is waiting. If the >> existence of the relevant file is detected, the `enabler' can release a hold >> of a certain job or even just submit the successor job. >> >> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ >> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can >> check for files. But the jobs will be submitted during the workflow and not >> all in advance. Maybe it is useful anyway. >> >> -- Reuti > > it would still be nice to know if it where possible to know implement > the "dormant" task approach. the company I work for would be willing > to pay for such development. depending on the feasibility. I'm still not sure what you mean by "dormant" state, as error state is not sufficient. Similar you can use `qhold 1234.42` and `qmod -rj 1234.42` to put the task 42 back into waiting state. In which state should a "dormant" task be? -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
