Thanks for your answers. @Dave The crashes seem to occur on RHEL/CentOS 5, I'm using Centos 6 and so far I did not have any crash.
The changeset linked seems to indicate there is no chance I can get this to work fine on 6.2u5, but I'm still wondering: if this is ruled by some regular polling, what defines its period ? May be it's configurable. I can afford more load on the qmaster host. 2012/9/3 Reuti <[email protected]> > Hi, > > Am 03.09.2012 um 17:11 schrieb Julien Nicoulaud: > > > I'm using SGE 6.2u5. I built a tightly integrated parallel environment > for my application, using "qrsh -inherit". Everything works fine, but at > the end of every job using the PE, there is a huge time between the moment > when the PE script returns and the moment when the parent qsub returns > (approx 2 minutes). > > > > The only case were it returns fast is when I send a SIGINT to the parent > qsub. In every other configuration, there is this delay. This happens > whatever the result of the PE script is, or whether qrsh processes are > cleanly shutdown before returning or not. > > > > My PE does not have any stop_proc_args, my queue no epilog. > > > > I can't find any relevant trace of what happens in the meantime in the > logs. > > > > Is this normal behavior? > > Yes. > > > > Is there some kind of polling mechanism? > > It's a safety precaution to be sure all tasks ended. > > But it's fixed in some follow up versions of SGE IIRC, but I can't find > any links about it. > > -- Reuti > > > > Regards, > > Julien > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
