Am 27.08.2012 um 18:20 schrieb Julien Nicoulaud: > Thanks for your answer, I'll deal with the clean shutdown in my application. > > However, do you know whether this task failure detection can be disabled ? It > is acceptable for me to have one worker process crashing, but not if it kills > the whole job as a side effect...
But the job is over anyway if I get you right. Or do you want to quit the slaves and continue with the master task? If you want to quit all slave tasks: maybe create a file <jobid>.stop and the slaves detect the presence and quit theirselfs then. -- Reuti > > 2012/8/26 Reuti <[email protected]> > Hi, > > Am 26.08.2012 um 15:42 schrieb Julien Nicoulaud: > > > I'm working on setting up a tightly integrated parallel environment for my > > application using the "qrsh -inherit" method, but I can't find the right > > way to terminate the qrsh sub-tasks. Whatever method I try, the parent job > > always ends with "Unable to run job N" > > You will get this message only if you start it with `-sync y`. It won't be in > any logfile otherwise. But I don't face the issue, that the workers run > forever. They are killed by the exit of the complete job, although not in a > nice way but by a `kill`. > > Maybe you can set in `qconf -mconf`: "execd_params > ENABLE_ADDGRP_KILL=TRUE" > > == > > The usual way to shut down slave tasks: use your own protocol which you want > to implement and tell your worker.sh this way: "Hey, kill yourself." > > == > > In principle it's supported to handle signals and the sge_execd can tell the > sge_shepherd to signal its kids. For a "normal" binary you can implement > actions to handle it in a proper way. Using the tight integration by `qrsh > -inherit ...` there is the special situation, that also the "qrsh_starter" > will get the signal and it will just exit forcing the job to end. > > -- Reuti > > > > message and the qmaster log contains: > > > > tightly integrated parallel task 159.1 task 1.vbox-centos6-3 failed - > > killing job > > > > Does anyone know the right way to handle this ? > > > > If this can help, I shared my test scripts here: > > https://gist.github.com/3479264 > > • test.sh: submits master.sh as a N slots parallel job > > • master.sh: > > • Launches N-1 worker.sh with "qrsh -inherit" in the > > background > > • Works for a while > > • Sends TERM to qrsh processes > > • worker.sh: works until killed > > By the way, I'm using SGE 6.2u5. > > > > Any help on this is welcome! > > > > Regards, > > Julien > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
