I have also seen that if one of users of the cluster writes some buggy code the workers die...any idea if these fixes will also help in that scenario ?
If you write buggy yarn apps and the code fails on cluster, jvm don't die.... On Jan 23, 2014 3:07 AM, "Sam Bessalah" <[email protected]> wrote: > Definitely. Thanks. I usually just played around timeouts before. But this > helps. Thx > > > On Thu, Jan 23, 2014 at 11:56 AM, Guillaume Pitel < > [email protected]> wrote: > >> Hi sparkers, >> >> So I had this problem where my workers were dying or disappearing (and I >> had to manually kill -9 their processes) often. Sometimes during a >> computation, sometimes when I Ctrl-C'd the driver, sometimes right at the >> end of an application execution. >> >> It seems that these tuning have solved the problem (in spark-env.sh): >> >> export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 >> -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true" >> >> export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 >> -Dspark.shuffle.consolidateFiles=true" >> >> Explanation : I've increased the timeout because I had this problem that >> the master was missing a heartbeat, thus removing the worker, and after >> that complaining that an unknown worker was sending heartbeats. I've also >> set the consolidateFiles option, because I noticed that deleting shuffle >> files in /tmp/spark-local* was taking forever because of the many files my >> job created. >> >> I also added this to all my programs right after the creation of the >> sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job >> : >> >> sys.addShutdownHook( { sc.stop() } ) >> >> Hope this can be useful to someone >> >> Guillaume >> -- >> [image: eXenSa] >> *Guillaume PITEL, Président* >> +33(0)6 25 48 86 80 >> >> eXenSa S.A.S. <http://www.exensa.com/> >> 41, rue Périer - 92120 Montrouge - FRANCE >> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >> > >
<<exensa_logo_mail.png>>
