Hi sparkers,

So I had this problem where my workers were dying or disappearing (and I had to manually kill -9 their processes) often. Sometimes during a computation, sometimes when I Ctrl-C'd the driver, sometimes right at the end of an application execution.

It seems that these tuning have solved the problem (in spark-env.sh):
export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
Explanation : I've increased the timeout because I had this problem that the master was missing a heartbeat, thus removing the worker, and after that complaining that an unknown worker was sending heartbeats. I've also set the consolidateFiles option, because I noticed that deleting shuffle files in /tmp/spark-local* was taking forever because of the many files my job created.

I also added this to all my programs right after the creation of the sparkContext (sc = sparkContext) to cleanly shutdown when cancelling a job :
sys.addShutdownHook( { sc.stop() } )
Hope this can be useful to someone

Guillaume
--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply via email to