Yes, as I said in my one of me previous messages, I was wrong about this,
the job only dies if you explicity send a SIGKILL/INT/TERM to the qrsh
process, so this is a non-issue.

Thanks for your help.

2012/8/28 Reuti <[email protected]>

> Am 28.08.2012 um 19:20 schrieb Julien Nicoulaud:
>
> > Yes, exactly that.
>
> So in total you have two issues:
>
> - clean shutdown
> - proper handling in case of a worker.sh crashes
>
> I don't get the problem with the second case. The `qrsh -inherit ...` will
> return and that's all.
>
> How is the task failure generated in your environment and what do you
> observe?
>
> -- Reuti
>
>
> > 2012/8/28 Reuti <[email protected]>
> > Am 28.08.2012 um 11:48 schrieb Julien Nicoulaud:
> >
> > > The FORBID_APPEROR parameter seems to be specific to applications
> returning 100.
> > >
> > > My concern was about a random slave process crashing in the middle of
> the run,
> >
> > Your application is fault-tolerant in such a way, that the other
> processes discover that one worker.sh crashes and compensate this failure?
> >
> > -- Reuti
> >
> >
> > > but I realize after some testing you really have to explicitely send a
> signal to the qrsh process to trigger task failure detection.
> > >
> > > Thanks for you help !
> > >
> > > 2012/8/28 William Hay <[email protected]>
> > > On 27 August 2012 17:20, Julien Nicoulaud <[email protected]>
> wrote:
> > > > Thanks for your answer, I'll deal with the clean shutdown in my
> application.
> > > >
> > > > However, do you know whether this task failure detection can be
> disabled ?
> > > > It is acceptable for me to have one worker process crashing, but not
> if it
> > > > kills the whole job as a side effect...
> > >
> > > For a similar but not identical problem we found setting
> > > FORBID_APPERROR=true in qmaster_params prevented gratuitous jobkills
> > > when a subtask of a job finished.
> > >
> > > William
> > >
> > > >
> > > >
> > > > 2012/8/26 Reuti <[email protected]>
> > > >>
> > > >> Hi,
> > > >>
> > > >> Am 26.08.2012 um 15:42 schrieb Julien Nicoulaud:
> > > >>
> > > >> > I'm working on setting up a tightly integrated parallel
> environment for
> > > >> > my application using the "qrsh -inherit" method, but I can't find
> the right
> > > >> > way to terminate the qrsh sub-tasks. Whatever method I try, the
> parent job
> > > >> > always ends with "Unable to run job N"
> > > >>
> > > >> You will get this message only if you start it with `-sync y`. It
> won't be
> > > >> in any logfile otherwise. But I don't face the issue, that the
> workers run
> > > >> forever. They are killed by the exit of the complete job, although
> not in a
> > > >> nice way but by a `kill`.
> > > >>
> > > >> Maybe you can set in `qconf -mconf`: "execd_params
> > > >> ENABLE_ADDGRP_KILL=TRUE"
> > > >>
> > > >> ==
> > > >>
> > > >> The usual way to shut down slave tasks: use your own protocol which
> you
> > > >> want to implement and tell your worker.sh this way: "Hey, kill
> yourself."
> > > >>
> > > >> ==
> > > >>
> > > >> In principle it's supported to handle signals and the sge_execd can
> tell
> > > >> the sge_shepherd to signal its kids. For a "normal" binary you can
> implement
> > > >> actions to handle it in a proper way. Using the tight integration
> by `qrsh
> > > >> -inherit ...` there is the special situation, that also the
> "qrsh_starter"
> > > >> will get the signal and it will just exit forcing the job to end.
> > > >>
> > > >> -- Reuti
> > > >>
> > > >>
> > > >> > message and the qmaster log contains:
> > > >> >
> > > >> > tightly integrated parallel task 159.1 task 1.vbox-centos6-3
> failed -
> > > >> > killing job
> > > >> >
> > > >> > Does anyone know the right way to handle this ?
> > > >> >
> > > >> > If this can help, I shared my test scripts here:
> > > >> > https://gist.github.com/3479264
> > > >> >       • test.sh: submits master.sh as a N slots parallel job
> > > >> >       • master.sh:
> > > >> >               • Launches N-1 worker.sh with "qrsh -inherit" in the
> > > >> > background
> > > >> >               • Works for a while
> > > >> >               • Sends TERM to qrsh processes
> > > >> >       • worker.sh: works until killed
> > > >> > By the way, I'm using SGE 6.2u5.
> > > >> >
> > > >> > Any help on this is welcome!
> > > >> >
> > > >> > Regards,
> > > >> > Julien
> > > >> > _______________________________________________
> > > >> > users mailing list
> > > >> > [email protected]
> > > >> > https://gridengine.org/mailman/listinfo/users
> > > >>
> > > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to