The FORBID_APPEROR parameter seems to be specific to applications returning
100.

My concern was about a random slave process crashing in the middle of the
run, but I realize after some testing you really have to explicitely send a
signal to the qrsh process to trigger task failure detection.

Thanks for you help !

2012/8/28 William Hay <[email protected]>

> On 27 August 2012 17:20, Julien Nicoulaud <[email protected]>
> wrote:
> > Thanks for your answer, I'll deal with the clean shutdown in my
> application.
> >
> > However, do you know whether this task failure detection can be disabled
> ?
> > It is acceptable for me to have one worker process crashing, but not if
> it
> > kills the whole job as a side effect...
>
> For a similar but not identical problem we found setting
> FORBID_APPERROR=true in qmaster_params prevented gratuitous jobkills
> when a subtask of a job finished.
>
> William
>
> >
> >
> > 2012/8/26 Reuti <[email protected]>
> >>
> >> Hi,
> >>
> >> Am 26.08.2012 um 15:42 schrieb Julien Nicoulaud:
> >>
> >> > I'm working on setting up a tightly integrated parallel environment
> for
> >> > my application using the "qrsh -inherit" method, but I can't find the
> right
> >> > way to terminate the qrsh sub-tasks. Whatever method I try, the
> parent job
> >> > always ends with "Unable to run job N"
> >>
> >> You will get this message only if you start it with `-sync y`. It won't
> be
> >> in any logfile otherwise. But I don't face the issue, that the workers
> run
> >> forever. They are killed by the exit of the complete job, although not
> in a
> >> nice way but by a `kill`.
> >>
> >> Maybe you can set in `qconf -mconf`: "execd_params
> >> ENABLE_ADDGRP_KILL=TRUE"
> >>
> >> ==
> >>
> >> The usual way to shut down slave tasks: use your own protocol which you
> >> want to implement and tell your worker.sh this way: "Hey, kill
> yourself."
> >>
> >> ==
> >>
> >> In principle it's supported to handle signals and the sge_execd can tell
> >> the sge_shepherd to signal its kids. For a "normal" binary you can
> implement
> >> actions to handle it in a proper way. Using the tight integration by
> `qrsh
> >> -inherit ...` there is the special situation, that also the
> "qrsh_starter"
> >> will get the signal and it will just exit forcing the job to end.
> >>
> >> -- Reuti
> >>
> >>
> >> > message and the qmaster log contains:
> >> >
> >> > tightly integrated parallel task 159.1 task 1.vbox-centos6-3 failed -
> >> > killing job
> >> >
> >> > Does anyone know the right way to handle this ?
> >> >
> >> > If this can help, I shared my test scripts here:
> >> > https://gist.github.com/3479264
> >> >       • test.sh: submits master.sh as a N slots parallel job
> >> >       • master.sh:
> >> >               • Launches N-1 worker.sh with "qrsh -inherit" in the
> >> > background
> >> >               • Works for a while
> >> >               • Sends TERM to qrsh processes
> >> >       • worker.sh: works until killed
> >> > By the way, I'm using SGE 6.2u5.
> >> >
> >> > Any help on this is welcome!
> >> >
> >> > Regards,
> >> > Julien
> >> > _______________________________________________
> >> > users mailing list
> >> > [email protected]
> >> > https://gridengine.org/mailman/listinfo/users
> >>
> >
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to